Operations Engineer, Fleet Reliability

CoreWeave•Livingston, NJ / New York, NY /Plano, TX / Sunnyvale, CA / Bellevue, WA / Richmond, VA

Full Timeon-site full-time devops operations ai

Apply Now

Operations Engineer, Fleet Reliability

CoreWeave•Livingston, NJ / New York, NY /Plano, TX / Sunnyvale, CA / Bellevue, WA / Richmond, VA

Full Timeon-site full-time devops+2 more

Apply Now

Operations Engineer, Fleet Reliability

CoreWeave

Apply Now

Operations Engineer, Fleet Reliability at CoreWeave, focusing on provisioning, management, and uptime of a large GPU-accelerated HPC fleet. The role sits at the intersection of hardware, software, data center, and platform teams, driving rapid deployment of nodes and ensuring cloud health for AI workloads.

Qualification

Curious, creative, and persistent problem solver mindset
Experience provisioning and validating large-scale server/node deployments
Strong ability to troubleshoot hardware and software issues across multiple layers
Experience coordinating with data center, network, hardware, and platform teams
Ability to monitor system performance and implement remediation actions
Proven ability to create and maintain clear documentation of processes and best practices
Willingness to adapt to shifting priorities in a fast-paced environment
Team-oriented with a focus on rapid deployment and reliability

Responsibility

Configure and maintain large-scale high-performance computing clusters running state-of-the-art GPUs
Provision and validate server nodes; deploy nodes as quickly as they can be racked and turned on
Troubleshoot hardware and software issues; escalate and coordinate with data center, network, hardware and platform teams to drive resolution
Monitor and analyze system performance; take remediation actions to maintain cloud health
Create and maintain documentation of team processes, knowledge, and best practices for system management
Collaborate with data center, network, hardware and platform teams to support uptime and reliability
Adapt to shifting business and technical priorities with flexibility and optimism

Operations Engineer, Fleet Reliability

Operations Engineer, Fleet Reliability

Operations Engineer, Fleet Reliability

Qualification

Responsibility

Similar Jobs

Systems Engineer, Open Architecture, Active Clearance

Director of Operations, Home Care

Staff Software Engineer-Greenplum

DataOps Engineer (AI Platform Engineer)

Staff Backend Engineer-RiskOS

Similar Jobs

Similar Jobs

Systems Engineer, Open Architecture, Active Clearance

Director of Operations, Home Care

Staff Software Engineer-Greenplum

DataOps Engineer (AI Platform Engineer)

Staff Backend Engineer-RiskOS