Job Title
HPC Engineer
Role Summary
Operate, support, and modernize on-premises and cloud high-performance computing (HPC) platforms used by engineering teams. The role focuses on job scheduling (notably IBM Spectrum LSF), production operations, automation, observability, and cloud integration to improve reliability, performance, and developer productivity.
Work with platform, cloud, storage, networking, security, and engineering teams to deliver scalable, automated HPC services and self-service tooling.
Experience Level
Mid-level. Exact years not specified; requires prior experience running production HPC or similar infrastructure, incident management, automation, and SRE/DevOps practices.
Responsibilities
Primary operational and engineering responsibilities for Arm's HPC platforms.
- Operate and support production HPC platforms and job schedulers (primary focus on IBM Spectrum LSF).
- Maintain service reliability, scalability, performance, and operational efficiency through automation and SRE practices.
- Develop automation, scripts, and self-service capabilities to reduce manual effort and improve user experience.
- Respond to incidents, perform operational recovery, conduct root cause analysis, and drive continuous improvement.
- Collaborate with engineering users to optimize job scheduling, workload performance, and resource utilization.
- Develop and maintain tooling using Python, Bash, and shell scripting.
- Support modernization initiatives: containers, Kubernetes, cloud-native services, and Infrastructure as Code.
- Enable cloud HPC integration across AWS, GCP, Azure, OpenStack, and hybrid environments.
- Work with technical leads, architects, and project teams to deliver platform improvements and projects.
- Define and promote standards for DevOps, CI/CD, monitoring, and infrastructure automation.
Requirements
Must-have and desirable technical skills and operational experience.
Must-have:
- Experience operating HPC environments and job schedulers (IBM Spectrum LSF, Slurm, PBS, Grid Engine, or similar).
- Strong Linux system administration (preferably RHEL or RHEL-based distributions).
- Scripting and automation skills (Python, Bash/shell).
- Experience supporting production infrastructure, incident management, operational recovery, and conducting RCA.
- Familiarity with monitoring and observability tools (e.g., Dynatrace, Prometheus, Grafana).
- Experience with CI/CD pipelines and automation frameworks.
- Experience with public, private, or hybrid cloud platforms and Kubernetes-based services.
- Understanding of DevOps, SRE, platform engineering, and infrastructure automation principles.
- Familiarity with Agile delivery and collaboration tools (Jira, Confluence).
- Ability to work directly with engineering users to translate workload requirements into operational improvements.
Nice-to-have:
- Experience in EDA or semiconductor engineering environments and familiarity with EDA tools and license-aware scheduling.
- Experience with containers, Kubernetes-native scheduling, Docker, and cloud-native orchestration.
- Infrastructure as Code experience (Terraform, Ansible).
- Exposure to alternative schedulers or cloud-native workload orchestration systems.
- Experience with AI-assisted tooling, automation agents, or large-scale distributed systems spanning on-prem and cloud.
Education Requirements
Not specified.
About the Company
Company: Arm
Headquarters: Cambridge, United Kingdom
ARM is a global leader in semiconductor and software design, driving innovation in computing technology. The company specializes in designing processors and systems that provide the essential building blocks for electronic devices. ARM's architecture is widely used in smartphones, servers, and IoT devices, and its collaborative culture fosters bold thinking, diversity, and high-impact benefits for its talented workforce.

Date Posted: 2026-05-11