Job Title
Senior Networking Solution Test Engineer - AI Cluster Debugging
Role Summary
Join the End‑to‑End Verification team to validate and debug Ethernet‑based AI clusters. The role owns complex, cross‑stack issues spanning hardware, system software and AI workloads and delivers reproducible fixes and actionable test suites.
Experience Level
Senior — requires extensive hands‑on experience; posting requests 10+ years of networking or system‑level testing and debugging on Linux.
Responsibilities
Primary responsibilities include building realistic testbeds, reproducing and triaging cluster issues, and collaborating with development teams to drive fixes.
- Design and review test and product requirements for Ethernet/NIC/DPU/switch behavior in large AI clusters.
- Build and maintain heterogeneous, customer‑like test environments (hardware, OS/driver combinations, network fabrics).
- Reproduce customer scenarios and perform end‑to‑end troubleshooting to identify root cause and track fixes.
- Inspect source code to identify defects, validate fixes, and improve logging and instrumentation.
- Debug NCCL, RoCE/RDMA and related networking components via logs, code inspection and targeted experiments.
- Define tests and guide automation to produce robust suites with actionable logs, metrics and traces.
- Run regression, performance, functional and scale testing; analyze results and report findings to stakeholders.
- Profile and benchmark deep learning training and inference workloads; correlate model metrics with system and network telemetry to find bottlenecks.
Requirements
Must‑have technical skills and proven experience for hands‑on cluster debugging and test ownership.
- 10+ years of hands‑on networking or system‑level testing and debugging on Linux.
- Strong Linux networking and debugging skills (e.g. perf, tcpdump, ethtool, iproute2).
- Proven production‑grade debugging: hypothesis formation, experiments, and driving issues to root cause under pressure.
- Expertise in host‑side NIC validation and tuning (offloads, queues, interrupts, firmware/driver interactions).
- Knowledge of AI networking libraries and protocols (NCCL, RoCE, RDMA) including performance and correctness debugging.
- Ability to read and reason about source code (C/C++/Python or similar) and collaborate with developers on fixes.
- Scripting and automation skills (Bash, Python, Ansible) for setup, log collection and experiment orchestration.
- Strong analytical, communication skills, ownership mindset and ability to work collaboratively.
Nice‑to‑have:
- Hands‑on debugging of collective communication libraries (e.g. NCCL) or large‑scale LLM training/inference clusters.
- Experience with very large cluster environments (tens to thousands of GPUs/nodes), incident response and post‑mortem analysis.
- Deep expertise tuning and debugging congestion control and lossless Ethernet for AI workloads (DCQCN, ECN, PFC).
- Familiarity with NVIDIA networking hardware and software (BlueField/BF3, ConnectX) and diagnostics.
- Experience debugging issues spanning multiple layers (L2/L3, transport, AI frameworks) or contributing to open‑source networking/AI systems.
Education Requirements
B.A./B.Sc. in Computer Science, Electrical Engineering, or equivalent IT/Network/Systems experience; employer accepts equivalent practical experience in lieu of degree.
About the Company
Company: NVIDIA
Headquarters: Santa Clara, California, USA
NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

Date Posted: 2026-06-30