System Software Engineer – Data Center GPU Compute Diagnostics

NVIDIA

May 21, 2026

Full-time

On-site

Durham, North Carolina, United States

$152,000 - $241,500 USD yearly

Test Engineering Jobs, Level - Mid-Career

Job Title

System Software Engineer – Data Center GPU Compute Diagnostics

Role Summary

Work on diagnostics and stress workloads for rack-scale Data Center GPUs used in AI supercomputers. The team develops low-level tests and tools for silicon bring-up, validation, manufacturing, and field triage.

The engineer will partner with senior technical leads to own scoped areas of CUDA/GEMM diagnostics, collaborate with hardware architecture, silicon validation, manufacturing and field teams, and contribute to system-level characterization.

Experience Level

Mid-level — the role expects approximately 5+ years of relevant experience in system software, GPU software, embedded software, or hardware validation.

Responsibilities

Key responsibilities include developing diagnostic workloads, validating hardware, and supporting cross-team debugging and productization.

Collaborate with hardware architecture, driver, manufacturing, and field teams across product development lifecycle.
Implement and maintain CUDA/C++ diagnostic workloads and associated software infrastructure.
Write and tune GPU compute tests that stress Tensor Cores, SMs, L2/cache, HBM memory, and power/thermal operating points.
Implement and tune GEMM-style diagnostics and tests that exercise NVLink, PCIe, and CPU subsystems.
Contribute to higher-level AI workload tests (e.g., PyTorch-based large-model workloads) for realistic rack-scale scenarios.
Bring up and validate new hardware features with pre-beta drivers, low-level diagnostics, and system telemetry.
Triage and debug failures involving ECC, HBM, thermal limits, voltage/frequency margining, and interconnect errors.

Requirements

Must-have technical skills, experience, and behaviors.

5+ years of system software, GPU software, embedded software, or hardware validation experience.
Experience writing low-level diagnostics and using device firmware and hardware-level debuggers.
Strong C/C++ and Python programming skills.
Exposure to GPU architecture, CUDA kernels, GEMM-style workloads, or accelerator programming.
Working knowledge of memory systems, ECC behavior, and DMA engines.
Familiarity with voltage/frequency characterization, thermal testing, power stress, and silicon validation concepts (e.g., Vmin/Fmax, P-state testing).
Experience using modern AI development and analysis tools to accelerate code development, debugging, and test creation.
Strong problem solving, low-level debugging skills, and ability to work cross-functionally.

Education Requirements

BS or MS in Electrical Engineering, Computer Engineering, Computer Science, or a related technical field — or equivalent practical experience.

About the Company

Company: NVIDIA

Headquarters: Santa Clara, California, USA

NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

Date Posted: 2026-05-21

Apply now

System Software Engineer – Data Center GPU Compute Diagnostics

Job Title

Role Summary

Experience Level

Responsibilities

Requirements

Education Requirements

About the Company

More jobs

Staff Silicon Design Engineer

Advanced Micro Devices

CPU Verification Engineer - Unit DV

Advanced Micro Devices