
Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)
Crusoe
Crusoe is seeking a Senior Software Engineer to lead the development of its observability platform, focusing on scalable systems that enhance the reliability and performance of its cloud infrastructure. The role involves designing telemetry pipelines, integrating monitoring tools, and ensuring security best practices while mentoring other engineers.
Qualification
- 7+ years of experience in infrastructure or platform engineering with a focus on observability
- Deep expertise with metrics systems (Prometheus, Thanos, Cortex) and logging pipelines (Fluent Bit, ELK)
- Strong programming skills in Go or Python for automation and custom integrations
- Experience running observability platforms on Kubernetes at scale
- Proven ability to design and optimize telemetry pipelines for high throughput data
- Solid understanding of distributed systems and cloud infrastructure
Responsibility
- Designing and operating scalable observability systems across multi-datacenter Kubernetes environments
- Architecting end-to-end telemetry pipelines for ingestion, storage, querying, and visualization
- Extending monitoring and alerting with tools like Prometheus, Grafana, and OpenTelemetry
- Building scalable log collection and processing pipelines using Fluent Bit, ELK/Opensearch stacks
- Implementing distributed tracing platforms and integrating with service meshes and APIs
- Defining and driving adoption of SLOs, SLIs, and error budgets across services
- Automating provisioning and scaling of observability infrastructure with Kubernetes and Terraform
- Embedding security best practices into observability platforms



