

Senior+ Site Reliability Engineer

Senior+ Site Reliability Engineer
Crusoe
Senior+ Site Reliability Engineer focused on Operational Excellence at Crusoe, building a reliable, energy-efficient AI-optimized GPU cloud. Drive incident response, reliability practices, and automation to reduce toil and improve resilience and disaster recovery across a large-scale distributed platform.
Qualification
- 5+ years of experience in cloud operations, SRE, or related roles.
- Strong understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS and/or GCP, virtualization, distributed systems).
- Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.).
- Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn.
- Familiarity with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible).
- Experience with incident response, RCA documentation, and post-incident reviews.
- Experience with large-scale distributed systems and disaster recovery planning.
- Strong collaboration and communication skills to partner with cross-functional teams.
Responsibility
- Define, refine, and track availability metrics with SLIs/SLOs for Crusoe's cloud infrastructure.
- Assist in incident response by identifying, diagnosing, and resolving service disruptions; contribute to post-incident RCAs and reviews.
- Build, operate, and monitor infrastructure health using the observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry).
- Identify reliability risks, performance bottlenecks, and early indicators of potential incidents; communicate findings to cross-functional teams.
- Develop automation and tooling to reduce operational toil and strengthen self-healing and service recovery capabilities.
- Collaborate with compute, network, storage, and platform teams to improve service resilience and disaster recovery readiness.
- Contribute to knowledge sharing, process improvements, and operational best practices across the organization.
- Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities.




