Senior Software Engineer - NVLink Rack-Scale Stability and Reliability

NVIDIA

May 23, 2026

Full-time

On-site

Santa Clara, California, United States

$152,000 - $287,500 USD yearly

Test Engineering Jobs, Level - Senior

Job Title

Senior Software Engineer - NVLink Rack-Scale Stability and Reliability

Role Summary

Join the Fabric Networking team to improve stability and reliability for NVLink- and NVSwitch-based rack-scale GPU systems. The role focuses on platform bringup, system validation, diagnostics, and making first-of-their-kind platforms production-ready at datacenter scale.

You will work across hardware, firmware, software, and customer teams to deliver resilient, debuggable systems and operational tooling for large-scale AI infrastructure.

Experience Level

Senior; the role expects 5+ years of relevant experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.

Responsibilities

Core responsibilities include validation, debugging, reliability engineering, and tooling for rack-scale systems.

Drive platform bringup, feature enablement, end-to-end software validation, and debug for NVLink-based GPU and rack-scale systems.
Develop tools, diagnostics, automation, and infrastructure for validation, regression testing, and fleet support.
Lead reliability and MTBI validation via stress testing, telemetry analysis, and failure injection.
Triage complex issues across software, firmware, networking, and hardware in validation, deployment, and production.
Collaborate with architecture, hardware, firmware, software, and customer teams to improve system quality.
Build and maintain SRE-style validation infrastructure including provisioning and monitoring.
Create automation, dashboards, runbooks, and debug workflows to speed root-cause analysis and operations.

Requirements

Must-have technical skills and experience for successful candidates.

5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.
Strong programming skills in C/C++ and Python; Bash/Shell scripting is a plus.
Proven system-level debugging across software, firmware, hardware, and networking layers.
Solid networking fundamentals: TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis.
Experience with platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging for large-scale AI systems.
Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods.
Strong communication and collaboration skills with engineering, customers, and operations teams.

Preferred / nice-to-have:

Experience with NVIDIA GPU systems, NVLink, NVSwitch, CUDA, and large-scale AI/HPC clusters.
Familiarity with PCIe, memory hierarchy, DMA, high-speed interconnects, distributed training/inference systems, server management technologies, and fleet monitoring.
Experience building diagnostics, automation, CI/CD pipelines, dashboards, and reliability tooling.

Education Requirements

BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent practical experience.

About the Company

Company: NVIDIA

Headquarters: Santa Clara, California, USA

NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

Date Posted: 2026-05-23

Apply now

Senior Software Engineer - NVLink Rack-Scale Stability and Reliability

Job Title

Role Summary

Experience Level

Responsibilities

Requirements

Education Requirements

About the Company

More jobs

Senior Engineering Leader, Analog / Mixed-Signal IC Design

ASICSoft

FPGA Engineer

Actalent