Job Title
System Software Engineer – Data Center GPU Compute Diagnostics
Role Summary
Work on diagnostics and stress workloads for rack-scale Data Center GPUs used in AI supercomputers. The team develops low-level tests and tools for silicon bring-up, validation, manufacturing, and field triage.
The engineer will partner with senior technical leads to own scoped areas of CUDA/GEMM diagnostics, collaborate with hardware architecture, silicon validation, manufacturing and field teams, and contribute to system-level characterization.
Experience Level
Mid-level — the role expects approximately 5+ years of relevant experience in system software, GPU software, embedded software, or hardware validation.
Responsibilities
Key responsibilities include developing diagnostic workloads, validating hardware, and supporting cross-team debugging and productization.
- Collaborate with hardware architecture, driver, manufacturing, and field teams across product development lifecycle.
- Implement and maintain CUDA/C++ diagnostic workloads and associated software infrastructure.
- Write and tune GPU compute tests that stress Tensor Cores, SMs, L2/cache, HBM memory, and power/thermal operating points.
- Implement and tune GEMM-style diagnostics and tests that exercise NVLink, PCIe, and CPU subsystems.
- Contribute to higher-level AI workload tests (e.g., PyTorch-based large-model workloads) for realistic rack-scale scenarios.
- Bring up and validate new hardware features with pre-beta drivers, low-level diagnostics, and system telemetry.
- Triage and debug failures involving ECC, HBM, thermal limits, voltage/frequency margining, and interconnect errors.
Requirements
Must-have technical skills, experience, and behaviors.
- 5+ years of system software, GPU software, embedded software, or hardware validation experience.
- Experience writing low-level diagnostics and using device firmware and hardware-level debuggers.
- Strong C/C++ and Python programming skills.
- Exposure to GPU architecture, CUDA kernels, GEMM-style workloads, or accelerator programming.
- Working knowledge of memory systems, ECC behavior, and DMA engines.
- Familiarity with voltage/frequency characterization, thermal testing, power stress, and silicon validation concepts (e.g., Vmin/Fmax, P-state testing).
- Experience using modern AI development and analysis tools to accelerate code development, debugging, and test creation.
- Strong problem solving, low-level debugging skills, and ability to work cross-functionally.
Education Requirements
BS or MS in Electrical Engineering, Computer Engineering, Computer Science, or a related technical field — or equivalent practical experience.
About the Company
Company: NVIDIA
Headquarters: Santa Clara, California, USA
NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

Date Posted: 2026-05-21