
Senior Cloud Support Engineer

Senior Cloud Support Engineer

Senior Cloud Support Engineer
CoreWeave
Senior Cloud Support Engineer at CoreWeave, a specialized AI-focused cloud provider delivering Kubernetes-powered HPC infrastructure for GPU workloads. The role centers on hands-on troubleshooting, incident response, customer success, and mentoring a 24/7/365 Engineering CX team to ensure high performance and reliability for AI training workloads.
Qualification
- 5+ years of experience in cloud/support engineering or a related field.
- Hands-on experience with Kubernetes-based production environments and container orchestration.
- Experience supporting GPU-enabled AI workloads and high-performance computing clusters.
- Strong troubleshooting, incident management, and root-cause analysis capabilities.
- Excellent customer-facing communication skills with proven ability to coach and mentor others.
- Experience mentoring or leading a team or junior engineers in a fast-paced environment.
- Ability to work in a 24/7 on-call shift environment with flexible scheduling.
- Proficiency in scripting/automation (e.g., Python or Bash) for diagnostics and tooling.
Responsibility
- Provide hands-on troubleshooting for GPU HPC cloud platforms powered by Kubernetes, resolving issues impacting AI training workloads and mission-critical applications.
- Lead and mentor team members across CoreWeave disciplines, helping them develop technical skills and strengthen troubleshooting capabilities.
- Deliver real-time feedback and coaching, review tickets, and identify opportunities for process and performance improvements.
- Manage incident response and participate in on-call rotations to maintain service levels and minimize downtime.
- Collaborate with data center, hardware, software engineering, and research teams to maintain platform integrity across data centers and client workloads.
- Contribute to postmortems, root-cause analyses, and knowledge-base documentation to prevent recurrence of issues.
- Optimize performance, reliability, and scalability of GPU workloads on Kubernetes-based HPC infrastructure; suggest and implement automation and tooling improvements.




