CoreWeave logo

Senior Cloud Support Engineer

CoreWeaveLivingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA
Full Timekubernetesdockerpython+5 more
Apply Now
CoreWeave logo

Senior Cloud Support Engineer

CoreWeave

Apply Now

Senior Cloud Support Engineer at CoreWeave, a specialized AI-focused cloud provider delivering Kubernetes-powered HPC infrastructure for GPU workloads. The role centers on hands-on troubleshooting, incident response, customer success, and mentoring a 24/7/365 Engineering CX team to ensure high performance and reliability for AI training workloads.

Qualification

  • 5+ years of experience in cloud/support engineering or a related field.
  • Hands-on experience with Kubernetes-based production environments and container orchestration.
  • Experience supporting GPU-enabled AI workloads and high-performance computing clusters.
  • Strong troubleshooting, incident management, and root-cause analysis capabilities.
  • Excellent customer-facing communication skills with proven ability to coach and mentor others.
  • Experience mentoring or leading a team or junior engineers in a fast-paced environment.
  • Ability to work in a 24/7 on-call shift environment with flexible scheduling.
  • Proficiency in scripting/automation (e.g., Python or Bash) for diagnostics and tooling.

Responsibility

  • Provide hands-on troubleshooting for GPU HPC cloud platforms powered by Kubernetes, resolving issues impacting AI training workloads and mission-critical applications.
  • Lead and mentor team members across CoreWeave disciplines, helping them develop technical skills and strengthen troubleshooting capabilities.
  • Deliver real-time feedback and coaching, review tickets, and identify opportunities for process and performance improvements.
  • Manage incident response and participate in on-call rotations to maintain service levels and minimize downtime.
  • Collaborate with data center, hardware, software engineering, and research teams to maintain platform integrity across data centers and client workloads.
  • Contribute to postmortems, root-cause analyses, and knowledge-base documentation to prevent recurrence of issues.
  • Optimize performance, reliability, and scalability of GPU workloads on Kubernetes-based HPC infrastructure; suggest and implement automation and tooling improvements.

Similar Jobs