

Operations Engineer, Fleet Reliability
CoreWeave
Operations Engineer, Fleet Reliability at CoreWeave, focusing on provisioning, management, and uptime of a large GPU-accelerated HPC fleet. The role sits at the intersection of hardware, software, data center, and platform teams, driving rapid deployment of nodes and ensuring cloud health for AI workloads.
Qualification
- Curious, creative, and persistent problem solver mindset
- Experience provisioning and validating large-scale server/node deployments
- Strong ability to troubleshoot hardware and software issues across multiple layers
- Experience coordinating with data center, network, hardware, and platform teams
- Ability to monitor system performance and implement remediation actions
- Proven ability to create and maintain clear documentation of processes and best practices
- Willingness to adapt to shifting priorities in a fast-paced environment
- Team-oriented with a focus on rapid deployment and reliability
Responsibility
- Configure and maintain large-scale high-performance computing clusters running state-of-the-art GPUs
- Provision and validate server nodes; deploy nodes as quickly as they can be racked and turned on
- Troubleshoot hardware and software issues; escalate and coordinate with data center, network, hardware and platform teams to drive resolution
- Monitor and analyze system performance; take remediation actions to maintain cloud health
- Create and maintain documentation of team processes, knowledge, and best practices for system management
- Collaborate with data center, network, hardware and platform teams to support uptime and reliability
- Adapt to shifting business and technical priorities with flexibility and optimism




