Crusoe logo

Senior+ Site Reliability Engineer

CrusoeSan Francisco, CA - US
FullTimeawsgcpkubernetes+4 more
Apply Now
Crusoe logo

Senior+ Site Reliability Engineer

Crusoe

Apply Now

Senior+ Site Reliability Engineer focused on Operational Excellence at Crusoe, building a reliable, energy-efficient AI-optimized GPU cloud. Drive incident response, reliability practices, and automation to reduce toil and improve resilience and disaster recovery across a large-scale distributed platform.

Qualification

  • 5+ years of experience in cloud operations, SRE, or related roles.
  • Strong understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS and/or GCP, virtualization, distributed systems).
  • Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.).
  • Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn.
  • Familiarity with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible).
  • Experience with incident response, RCA documentation, and post-incident reviews.
  • Experience with large-scale distributed systems and disaster recovery planning.
  • Strong collaboration and communication skills to partner with cross-functional teams.

Responsibility

  • Define, refine, and track availability metrics with SLIs/SLOs for Crusoe's cloud infrastructure.
  • Assist in incident response by identifying, diagnosing, and resolving service disruptions; contribute to post-incident RCAs and reviews.
  • Build, operate, and monitor infrastructure health using the observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry).
  • Identify reliability risks, performance bottlenecks, and early indicators of potential incidents; communicate findings to cross-functional teams.
  • Develop automation and tooling to reduce operational toil and strengthen self-healing and service recovery capabilities.
  • Collaborate with compute, network, storage, and platform teams to improve service resilience and disaster recovery readiness.
  • Contribute to knowledge sharing, process improvements, and operational best practices across the organization.
  • Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities.

Similar Jobs