
Software Engineer, Site Reliability (SRE)

Software Engineer, Site Reliability (SRE)
Sierra
Sierra is seeking a Software Engineer for their Site Reliability team to enhance the reliability, observability, and scalability of their AI-driven infrastructure. The role involves collaborating with engineering and product teams to ensure system efficiency and availability, while also leading improvements in deployment and incident management processes. The company values trust, customer obsession, craftsmanship, intensity, and family, and is primarily based in San Francisco with offices in multiple global cities.
Qualification
- 5+ years of hands-on experience in Site Reliability or Infrastructure engineering roles.
- Experience designing for availability, scalability, and reliability at infrastructure and application layers.
- Deep experience with Terraform, AWS services, and container orchestration.
- Strong background in observability systems like Prometheus, Grafana, or Datadog.
- Experience working with enterprise customers and understanding their compliance and networking needs.
Responsibility
- Own Sierra’s observability stack—monitoring, alerting, logging, and tracing.
- Partner with product and platform engineers to design reliable and scalable systems.
- Design and implement scalable, reliable, and secure cloud infrastructure using AWS and Terraform.
- Improve the reliability and scalability of LLM deployments.
- Lead improvements to deployment pipelines, CI/CD tooling, and incident management processes.
- Define the foundation of SRE practices at Sierra, influencing culture and tooling.




