Marvell Technology logo

Staff Engineer, Engineering Compute Infrastructure and Grid Operations

Marvell Technology
April 30, 2026
Full-time
On-site
Westborough, Massachusetts, United States
$128,000 - $189,370 USD yearly
Other Semiconductor Jobs, Level - Senior

Job Title

Staff Engineer, Engineering Compute Infrastructure and Grid Operations

Role Summary

Design, operate, and improve compute infrastructure that supports large-scale chip design and verification. The role focuses on grid job management, shared storage, reliability, monitoring, and operational processes to maximize engineering throughput.

This position works with engineering users, tools teams, and central IT to ensure stable, high-throughput compute and storage environments.

Experience Level

Senior — typically 8+ years in compute infrastructure, grid operations, or large-scale engineering compute environments.

Responsibilities

Key responsibilities include managing grid and storage infrastructure, improving reliability, and coordinating cross-team operational activities.

  • Own and evolve grid job management for large regressions and high-volume batch workloads.
  • Debug and resolve grid job failures including scheduling issues, hung jobs, resource starvation, and intermittent infrastructure faults.
  • Improve job reliability using watchdogs, retries, heartbeats, timeouts, and failure detection mechanisms.
  • Work with job controllers and wrapper layers to ensure consistent behavior across grid environments (e.g., LSF, UGE).
  • Diagnose and resolve shared storage issues related to I/O performance, file contention, permissions, and cross-mounted filesystems.
  • Identify and mitigate storage-related failure modes that cause job instability or data corruption.
  • Design and deploy monitoring, logging, and metrics to detect infrastructure problems early and perform root-cause analysis of intermittent failures.
  • Partner with IT and compute teams during grid/filesystem migrations, upgrades, and expansions; document procedures and runbooks.
  • Lead incident response, communicate during incidents and post-mortems, and define best practices to prevent repeat incidents.

Requirements

Must-have technical skills and experience:

  • 8+ years of experience in compute infrastructure, grid operations, or large-scale engineering environments.
  • Strong experience with batch schedulers (LSF, UGE, Slurm, PBS).
  • Hands-on experience debugging distributed systems and batch job failures.
  • Strong Linux systems knowledge, including process management and resource monitoring.
  • Experience with shared storage systems (NFS, enterprise filers, high-performance filesystems).
  • Strong scripting skills (Python, shell, or similar).

Nice-to-have:

  • Experience supporting EDA or engineering compute workloads.
  • Familiarity with job-controller or wrapper-based execution architectures and operating environments with thousands of concurrent batch jobs.
  • Exposure to cloud or hybrid compute environments and prior migration experience.
  • Incident response and post-mortem leadership experience.

Education Requirements

Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience. The posting states a bachelor’s degree or equivalent experience; no specific certifications were required.


About the Company

Company: Marvell Technology

Headquarters: Santa Clara, California, United States

Marvell’s semiconductor solutions serve as essential building blocks of the data infrastructure connecting our world, driving innovation across enterprise, cloud, AI, and carrier architectures. The company focuses on creating transformative technology that shapes the future.

Marvell Technology logo

Date Posted: 2026-04-29