NVIDIA logo

System Software Engineer – Data Center GPU Compute Diagnostics

NVIDIA
May 21, 2026
Full-time
On-site
Durham, North Carolina, United States
$152,000 - $241,500 USD yearly
Test Engineering Jobs, Level - Mid-Career

Job Title

System Software Engineer – Data Center GPU Compute Diagnostics

Role Summary

Work on diagnostics and stress workloads for rack-scale Data Center GPUs used in AI supercomputers. The team develops low-level tests and tools for silicon bring-up, validation, manufacturing, and field triage.

The engineer will partner with senior technical leads to own scoped areas of CUDA/GEMM diagnostics, collaborate with hardware architecture, silicon validation, manufacturing and field teams, and contribute to system-level characterization.

Experience Level

Mid-level — the role expects approximately 5+ years of relevant experience in system software, GPU software, embedded software, or hardware validation.

Responsibilities

Key responsibilities include developing diagnostic workloads, validating hardware, and supporting cross-team debugging and productization.

  • Collaborate with hardware architecture, driver, manufacturing, and field teams across product development lifecycle.
  • Implement and maintain CUDA/C++ diagnostic workloads and associated software infrastructure.
  • Write and tune GPU compute tests that stress Tensor Cores, SMs, L2/cache, HBM memory, and power/thermal operating points.
  • Implement and tune GEMM-style diagnostics and tests that exercise NVLink, PCIe, and CPU subsystems.
  • Contribute to higher-level AI workload tests (e.g., PyTorch-based large-model workloads) for realistic rack-scale scenarios.
  • Bring up and validate new hardware features with pre-beta drivers, low-level diagnostics, and system telemetry.
  • Triage and debug failures involving ECC, HBM, thermal limits, voltage/frequency margining, and interconnect errors.

Requirements

Must-have technical skills, experience, and behaviors.

  • 5+ years of system software, GPU software, embedded software, or hardware validation experience.
  • Experience writing low-level diagnostics and using device firmware and hardware-level debuggers.
  • Strong C/C++ and Python programming skills.
  • Exposure to GPU architecture, CUDA kernels, GEMM-style workloads, or accelerator programming.
  • Working knowledge of memory systems, ECC behavior, and DMA engines.
  • Familiarity with voltage/frequency characterization, thermal testing, power stress, and silicon validation concepts (e.g., Vmin/Fmax, P-state testing).
  • Experience using modern AI development and analysis tools to accelerate code development, debugging, and test creation.
  • Strong problem solving, low-level debugging skills, and ability to work cross-functionally.

Education Requirements

BS or MS in Electrical Engineering, Computer Engineering, Computer Science, or a related technical field — or equivalent practical experience.


About the Company

Company: NVIDIA

Headquarters: Santa Clara, California, USA

NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

NVIDIA logo

Date Posted: 2026-05-21