Job Title
Senior RAS and Power Management Firmware Architect
Role Summary
Technical leader responsible for defining and driving firmware architecture for reliability, availability, serviceability (RAS) and power management across NVIDIA networking products and platforms. Works hands-on with hardware, firmware, software, validation, customer engineering, and external partners to deliver diagnosable, power-efficient systems for large-scale deployments.
Experience Level
Senior β 7+ years of relevant experience in firmware, platform architecture, embedded systems, or low-level systems software.
Responsibilities
Design and guide platform-level firmware features for RAS and power management; lead cross-functional architecture and validation activities.
- Define firmware architecture for RAS, error handling, containment, recovery, escalation, and reporting.
- Specify firmware behavior for power sequencing, power states, reset flows, thermal and power-fault handling, idle management, and recovery flows.
- Create specifications for hardware error handling, health monitoring, crash capture, telemetry, diagnostics, and field serviceability.
- Define interfaces and contracts between firmware, hardware, OS, BMCs/management controllers, platform software, and cloud/service infrastructure.
- Drive architecture reviews, tradeoff analysis, failure-mode analysis, and validation strategy for RAS and power management features.
- Establish standards for error logs, event schemas, telemetry flows, recovery policies, service diagnostics, and production debug infrastructure.
- Guide teams through implementation, silicon bring-up, platform integration, validation, and production deployment.
- Analyze customer and field failures, identify architectural gaps, and incorporate lessons learned into platform roadmaps.
Requirements
Must-have technical skills and proven experience to lead firmware architecture for complex platforms.
- 7+ years designing or implementing firmware/platform architecture, embedded systems, or low-level systems software.
- Deep understanding of RAS principles: fault modeling, error containment, recovery policies, diagnosability, and serviceability.
- Experience architecting firmware for complex platforms such as SoCs, accelerators, DPUs, servers, networking devices, or embedded systems.
- Strong knowledge of power management: sequencing, reset architecture, thermal/power fault handling, power state transitions, and recovery flows.
- Familiarity with boot firmware, UEFI/BIOS, BMCs, embedded controllers, RTOS, embedded Linux, or platform management stacks.
- Solid understanding of hardware/software interfaces: registers, interrupts, telemetry paths, debug infrastructure, and firmware-to-hardware contracts.
- Proficiency with programming and debugging fundamentals (C/C++, scripting such as Python/Perl, RTL or low-level languages such as Verilog or assembly/RISC-V assembly).
- Ability to lead cross-functional architecture discussions and drive alignment across hardware, firmware, software, validation, product, and customer teams; strong communication and technical leadership.
Nice-to-have:
- Experience with PCIe AER, CXL RAS, memory RAS, ECC/parity, accelerator or networking RAS, or high-availability systems.
- Familiarity with ACPI, SMBIOS, UEFI platform standards, PLDM, MCTP, Redfish, IPMI, or cloud telemetry systems.
- Experience with power/thermal fault handling, dynamic power management, low-power states, autonomous recovery mechanisms, silicon bring-up, platform validation, or production diagnostics.
- Prior technical leadership as a firmware architect, principal engineer, platform lead, or domain owner.
Education Requirements
BSc, MS, or PhD in Electrical Engineering, Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience.
About the Company
Company: NVIDIA
Headquarters: Santa Clara, California, USA
NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

Date Posted: 2026-06-03