NVIDIA logo

Senior Software Engineer - NVLink Rack-Scale Stability and Reliability

NVIDIA
May 23, 2026
Full-time
On-site
Santa Clara, California, United States
$152,000 - $287,500 USD yearly
Test Engineering Jobs, Level - Senior

Job Title

Senior Software Engineer - NVLink Rack-Scale Stability and Reliability

Role Summary

Join the Fabric Networking team to improve stability and reliability for NVLink- and NVSwitch-based rack-scale GPU systems. The role focuses on platform bringup, system validation, diagnostics, and making first-of-their-kind platforms production-ready at datacenter scale.

You will work across hardware, firmware, software, and customer teams to deliver resilient, debuggable systems and operational tooling for large-scale AI infrastructure.

Experience Level

Senior; the role expects 5+ years of relevant experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.

Responsibilities

Core responsibilities include validation, debugging, reliability engineering, and tooling for rack-scale systems.

  • Drive platform bringup, feature enablement, end-to-end software validation, and debug for NVLink-based GPU and rack-scale systems.
  • Develop tools, diagnostics, automation, and infrastructure for validation, regression testing, and fleet support.
  • Lead reliability and MTBI validation via stress testing, telemetry analysis, and failure injection.
  • Triage complex issues across software, firmware, networking, and hardware in validation, deployment, and production.
  • Collaborate with architecture, hardware, firmware, software, and customer teams to improve system quality.
  • Build and maintain SRE-style validation infrastructure including provisioning and monitoring.
  • Create automation, dashboards, runbooks, and debug workflows to speed root-cause analysis and operations.

Requirements

Must-have technical skills and experience for successful candidates.

  • 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.
  • Strong programming skills in C/C++ and Python; Bash/Shell scripting is a plus.
  • Proven system-level debugging across software, firmware, hardware, and networking layers.
  • Solid networking fundamentals: TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis.
  • Experience with platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging for large-scale AI systems.
  • Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods.
  • Strong communication and collaboration skills with engineering, customers, and operations teams.

Preferred / nice-to-have:

  • Experience with NVIDIA GPU systems, NVLink, NVSwitch, CUDA, and large-scale AI/HPC clusters.
  • Familiarity with PCIe, memory hierarchy, DMA, high-speed interconnects, distributed training/inference systems, server management technologies, and fleet monitoring.
  • Experience building diagnostics, automation, CI/CD pipelines, dashboards, and reliability tooling.

Education Requirements

BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent practical experience.


About the Company

Company: NVIDIA

Headquarters: Santa Clara, California, USA

NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

NVIDIA logo

Date Posted: 2026-05-23