Senior+ Site Reliability Engineer

Crusoe•San Francisco, CA - US

FullTimeaws gcp kubernetes terraform ansible ci-cd devops

Apply Now

Senior+ Site Reliability Engineer

Crusoe•San Francisco, CA - US

FullTimeaws gcp kubernetes+4 more

Apply Now

Senior+ Site Reliability Engineer

Crusoe

Apply Now

Senior+ Site Reliability Engineer focused on Operational Excellence at Crusoe, building a reliable, energy-efficient AI-optimized GPU cloud. Drive incident response, reliability practices, and automation to reduce toil and improve resilience and disaster recovery across a large-scale distributed platform.

Qualification

5+ years of experience in cloud operations, SRE, or related roles.
Strong understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS and/or GCP, virtualization, distributed systems).
Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.).
Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn.
Familiarity with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible).
Experience with incident response, RCA documentation, and post-incident reviews.
Experience with large-scale distributed systems and disaster recovery planning.
Strong collaboration and communication skills to partner with cross-functional teams.

Responsibility

Define, refine, and track availability metrics with SLIs/SLOs for Crusoe's cloud infrastructure.
Assist in incident response by identifying, diagnosing, and resolving service disruptions; contribute to post-incident RCAs and reviews.
Build, operate, and monitor infrastructure health using the observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry).
Identify reliability risks, performance bottlenecks, and early indicators of potential incidents; communicate findings to cross-functional teams.
Develop automation and tooling to reduce operational toil and strengthen self-healing and service recovery capabilities.
Collaborate with compute, network, storage, and platform teams to improve service resilience and disaster recovery readiness.
Contribute to knowledge sharing, process improvements, and operational best practices across the organization.
Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities.

Senior+ Site Reliability Engineer

Senior+ Site Reliability Engineer

Senior+ Site Reliability Engineer

Qualification

Responsibility

Similar Jobs

Software Engineer - Customer Success

Analytics Engineer

Support Platform Senior Manager

Sr. Machine Learning Engineer, Off-board Perception

Solutions Engineer (India Startup Program)

Similar Jobs

Similar Jobs

Software Engineer - Customer Success

Analytics Engineer

Support Platform Senior Manager

Sr. Machine Learning Engineer, Off-board Perception

Solutions Engineer (India Startup Program)