CoreWeave logo

Operations Engineer, Fleet Reliability

CoreWeaveLivingston, NJ / New York, NY /Plano, TX / Sunnyvale, CA / Bellevue, WA / Richmond, VA
Full Timeon-sitefull-timedevops+2 more
Apply Now
CoreWeave logo

Operations Engineer, Fleet Reliability

CoreWeave

Apply Now

Operations Engineer, Fleet Reliability at CoreWeave, focusing on provisioning, management, and uptime of a large GPU-accelerated HPC fleet. The role sits at the intersection of hardware, software, data center, and platform teams, driving rapid deployment of nodes and ensuring cloud health for AI workloads.

Qualification

  • Curious, creative, and persistent problem solver mindset
  • Experience provisioning and validating large-scale server/node deployments
  • Strong ability to troubleshoot hardware and software issues across multiple layers
  • Experience coordinating with data center, network, hardware, and platform teams
  • Ability to monitor system performance and implement remediation actions
  • Proven ability to create and maintain clear documentation of processes and best practices
  • Willingness to adapt to shifting priorities in a fast-paced environment
  • Team-oriented with a focus on rapid deployment and reliability

Responsibility

  • Configure and maintain large-scale high-performance computing clusters running state-of-the-art GPUs
  • Provision and validate server nodes; deploy nodes as quickly as they can be racked and turned on
  • Troubleshoot hardware and software issues; escalate and coordinate with data center, network, hardware and platform teams to drive resolution
  • Monitor and analyze system performance; take remediation actions to maintain cloud health
  • Create and maintain documentation of team processes, knowledge, and best practices for system management
  • Collaborate with data center, network, hardware and platform teams to support uptime and reliability
  • Adapt to shifting business and technical priorities with flexibility and optimism

Similar Jobs