Job Title
Senior Software Engineer - NVLink Rack-Scale Stability and Reliability
Role Summary
Join the Fabric Networking team to improve stability and reliability for NVLink- and NVSwitch-based rack-scale GPU systems. The role focuses on platform bringup, system validation, diagnostics, and making first-of-their-kind platforms production-ready at datacenter scale.
You will work across hardware, firmware, software, and customer teams to deliver resilient, debuggable systems and operational tooling for large-scale AI infrastructure.
Experience Level
Senior; the role expects 5+ years of relevant experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.
Responsibilities
Core responsibilities include validation, debugging, reliability engineering, and tooling for rack-scale systems.
- Drive platform bringup, feature enablement, end-to-end software validation, and debug for NVLink-based GPU and rack-scale systems.
- Develop tools, diagnostics, automation, and infrastructure for validation, regression testing, and fleet support.
- Lead reliability and MTBI validation via stress testing, telemetry analysis, and failure injection.
- Triage complex issues across software, firmware, networking, and hardware in validation, deployment, and production.
- Collaborate with architecture, hardware, firmware, software, and customer teams to improve system quality.
- Build and maintain SRE-style validation infrastructure including provisioning and monitoring.
- Create automation, dashboards, runbooks, and debug workflows to speed root-cause analysis and operations.
Requirements
Must-have technical skills and experience for successful candidates.
- 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.
- Strong programming skills in C/C++ and Python; Bash/Shell scripting is a plus.
- Proven system-level debugging across software, firmware, hardware, and networking layers.
- Solid networking fundamentals: TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis.
- Experience with platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging for large-scale AI systems.
- Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods.
- Strong communication and collaboration skills with engineering, customers, and operations teams.
Preferred / nice-to-have:
- Experience with NVIDIA GPU systems, NVLink, NVSwitch, CUDA, and large-scale AI/HPC clusters.
- Familiarity with PCIe, memory hierarchy, DMA, high-speed interconnects, distributed training/inference systems, server management technologies, and fleet monitoring.
- Experience building diagnostics, automation, CI/CD pipelines, dashboards, and reliability tooling.
Education Requirements
BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent practical experience.
About the Company
Company: NVIDIA
Headquarters: Santa Clara, California, USA
NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

Date Posted: 2026-05-23