Job Title
Staff Engineer, Engineering Compute Infrastructure and Grid Operations
Role Summary
Design, operate, and improve compute infrastructure that supports large-scale chip design and verification. The role focuses on grid job management, shared storage, reliability, monitoring, and operational processes to maximize engineering throughput.
This position works with engineering users, tools teams, and central IT to ensure stable, high-throughput compute and storage environments.
Experience Level
Senior — typically 8+ years in compute infrastructure, grid operations, or large-scale engineering compute environments.
Responsibilities
Key responsibilities include managing grid and storage infrastructure, improving reliability, and coordinating cross-team operational activities.
- Own and evolve grid job management for large regressions and high-volume batch workloads.
- Debug and resolve grid job failures including scheduling issues, hung jobs, resource starvation, and intermittent infrastructure faults.
- Improve job reliability using watchdogs, retries, heartbeats, timeouts, and failure detection mechanisms.
- Work with job controllers and wrapper layers to ensure consistent behavior across grid environments (e.g., LSF, UGE).
- Diagnose and resolve shared storage issues related to I/O performance, file contention, permissions, and cross-mounted filesystems.
- Identify and mitigate storage-related failure modes that cause job instability or data corruption.
- Design and deploy monitoring, logging, and metrics to detect infrastructure problems early and perform root-cause analysis of intermittent failures.
- Partner with IT and compute teams during grid/filesystem migrations, upgrades, and expansions; document procedures and runbooks.
- Lead incident response, communicate during incidents and post-mortems, and define best practices to prevent repeat incidents.
Requirements
Must-have technical skills and experience:
- 8+ years of experience in compute infrastructure, grid operations, or large-scale engineering environments.
- Strong experience with batch schedulers (LSF, UGE, Slurm, PBS).
- Hands-on experience debugging distributed systems and batch job failures.
- Strong Linux systems knowledge, including process management and resource monitoring.
- Experience with shared storage systems (NFS, enterprise filers, high-performance filesystems).
- Strong scripting skills (Python, shell, or similar).
Nice-to-have:
- Experience supporting EDA or engineering compute workloads.
- Familiarity with job-controller or wrapper-based execution architectures and operating environments with thousands of concurrent batch jobs.
- Exposure to cloud or hybrid compute environments and prior migration experience.
- Incident response and post-mortem leadership experience.
Education Requirements
Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience. The posting states a bachelor’s degree or equivalent experience; no specific certifications were required.
About the Company
Company: Marvell Technology
Headquarters: Santa Clara, California, United States
Marvell’s semiconductor solutions serve as essential building blocks of the data infrastructure connecting our world, driving innovation across enterprise, cloud, AI, and carrier architectures. The company focuses on creating transformative technology that shapes the future.

Date Posted: 2026-04-29