Job Title
Principal Software Engineer, GPU Firmware and GPU System Software — CSP Engagements
Role Summary
Technical lead focused on GPU firmware and GPU system software for cloud service provider (CSP) and hyperscale customers. Act as the primary NVIDIA technical contact to ensure CSP teams can manage, update, and operate GPU firmware reliably at fleet scale.
You will coordinate cross-CSP work streams, gather operational feedback to influence NVIDIA's firmware/software roadmap, and ensure customer automation, update sequencing, and recovery procedures are validated before releases.
Experience Level
Senior / Principal level. The role expects extensive experience; the posting specifies 15+ years of relevant experience.
Responsibilities
Work directly with CSP engineering teams to integrate, operate, and improve GPU firmware and system software at scale.
- Lead GPU firmware and software work streams with CSP engineering teams; explain firmware architecture, update sequencing, recovery procedures, and power management.
- Collect and synthesize customer requirements on manageability, observability, security, and performance; represent these priorities in NVIDIA's roadmap.
- Design and validate firmware update orchestration for large fleets: multi-GPU sequencing, rollback, staged rollouts, failure handling, and validation.
- Serve as the technical interface between NVIDIA and CSP firmware/software engineers; document behaviors and integration points for customers.
- Identify cross-CSP patterns in firmware/software issues and drive documentation, tooling, and test strategy improvements.
Requirements
Must-have technical skills and experience for successful performance in this role.
- 15+ years of experience in GPU system software, GPU firmware, or accelerator platform engineering.
- Deep understanding of GPU architecture internals: streaming multiprocessors, compute kernels, memory hierarchy, and firmware/driver performance interactions.
- Knowledge of multi-GPU fabric architectures (e.g., NVLink) and firmware coordination across rack-scale systems.
- Familiarity with GPU firmware components: VBIOS, GPU microcontroller firmware, InfoROM, and their interaction with driver stacks.
- Proven experience with firmware update lifecycle management at scale: A/B updates, multi-device sequencing, rollback, staged rollout, and emergency recovery.
- Understanding of GPU error handling and recovery flows and how firmware-level errors surface to applications.
- Experience with GPU health monitoring and telemetry (Xid errors, thermal/power events, ECC counters) and interpreting their operational significance.
- Strong customer-facing and influencing skills to align engineering teams on quality and fleet manageability improvements.
Nice-to-have:
- Direct experience with NVIDIA GPU VBIOS, GPU microcontroller firmware, or driver internals.
- Background in managing GPU fleets at 10K+ GPU scale, including rollout and remediation strategies.
- Experience building error taxonomies and operational runbooks for GPU firmware behavior.
- Knowledge of GPU security (secure boot, code signing, attestation) and firmware-level multi-tenancy isolation.
- Familiarity with GPU power management architecture and its impact on workload performance.
Education Requirements
Bachelor's or Master’s degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent practical experience. (The posting explicitly allows equivalent experience.)
About the Company
Company: NVIDIA
Headquarters: Santa Clara, California, USA
NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

Date Posted: 2026-06-26