Job Title
DevOps Engineer, HPC and LSF
Role Summary
As a member of the Hardware Infrastructure Farm team, provide engineering and operational leadership to build and operate large-scale compute clusters that support silicon development. Focus on reliability, performance, automation, and improving engineering productivity.
Work includes system-level troubleshooting, automation of deployments and configuration, and collaborating with chip development teams to optimize infrastructure usage.
Experience Level
Mid-level β requires 3+ years experience in large, distributed Linux environments.
Responsibilities
Primary responsibilities include operating and improving HPC infrastructure and schedulers:
- Manage and support workload and resource schedulers (e.g., IBM Spectrum LSF or SLURM) in large-scale HPC clusters.
- Develop automation for deployment, configuration management, and operational monitoring.
- Collect and analyze grid and cluster performance metrics for troubleshooting and optimization.
- Troubleshoot issues across the stack from bare metal to application level.
- Define and document standard methodologies, runbooks, and best practices for internal teams.
- Collaborate with domain experts to improve how chip development uses infrastructure.
- Contribute to reliability improvements and reduce time to market for hardware projects.
Requirements
Must-have technical skills and experience:
- Extensive experience administering job schedulers such as IBM Spectrum LSF or SLURM.
- Proficient with CentOS/RHEL Linux administration.
- Hands-on experience with container technologies (Docker).
- Proficient in UNIX shell scripting and Python.
- 3+ years operating in a large, distributed Linux environment.
- Strong problem-solving, communication, and teamwork skills.
Nice-to-have:
- Experience analyzing and tuning performance for HPC or EDA workloads.
- Familiarity with configuration management tools such as Ansible.
- Experience with Perl for maintaining legacy automation scripts.
- Deep understanding of distributed systems principles.
Education Requirements
BS in Computer Science or a similar degree, or equivalent practical experience.
About the Company
Company: NVIDIA
Headquarters: Santa Clara, California, USA
NVIDIA is a global leader in accelerated computing, renowned for its innovative solutions in AI and digital twins that transform diverse industries. The company specializes in networking technologies, providing end-to-end InfiniBand and Ethernet solutions for servers and storage that optimize performance and scalability. NVIDIA serves sectors such as high-performance computing, enterprise data centers, and cloud computing, constantly reinventing its products and services to stay ahead in the market.

Date Posted: 2026-06-03