CoreWeave logo

Operations Engineering Manager, Fleet Reliability

CoreWeaveLivingston, NJ / New York, NY / Richmond, VA
Apply Now
CoreWeave logo

Operations Engineering Manager, Fleet Reliability

CoreWeave

Apply Now

CoreWeave is seeking an Operations Engineering Manager for its Fleet Reliability Operations team, responsible for overseeing the provisioning, updating, and maintenance of server nodes. The role involves leading a 24/7 team focused on reliability and automation, while ensuring high customer satisfaction and process improvement as the company scales its fleet significantly.

Qualification

  • Experience in operations management or engineering management roles.
  • Strong understanding of server infrastructure and reliability engineering principles.
  • Proven ability to lead and develop high-performing teams.
  • Experience with process improvement methodologies and automation tools.
  • Excellent communication and documentation skills.

Responsibility

  • Build and lead a 24/7 team of process-oriented, reliability and observability-focused engineers.
  • Lead the socialization and documentation of clear and consistent processes for provisioning, validating and troubleshooting nodes in our server fleet.
  • Advocate for process and automation improvements prioritizing event-driven automated remediation.
  • Provide a 24/7 engineering support function for high-criticality, time-sensitive node delivery and maintenance.
  • Drive onboarding, documentation, enablement, and performance management for team members.

Similar Jobs