

Software Engineer, Infrastructure

Software Engineer, Infrastructure
exa
Exa is seeking an Infra Engineer to join their Infrastructure Team, which is responsible for building the foundational tooling and infrastructure for their AI-driven search engine. The role involves designing and operating large-scale systems, particularly focusing on GPU clusters and Kubernetes orchestration, to enhance performance and reliability.
Qualification
- Experience with large-scale infrastructure design and operation.
- Familiarity with GPU clusters and Kubernetes.
- Knowledge of cloud batch job systems.
- Strong focus on reliability, observability, and optimization.
- Ability to work in a fast-paced engineering environment.
Responsibility
- Design and operate large-scale infrastructure, including GPU clusters and Kubernetes clusters.
- Build GPU cluster orchestration using Kubernetes.
- Scale AWS batch job systems to manage map-reduce jobs across thousands of machines.
- Develop GPU scheduling software to optimize cluster utilization.
- Implement observability tooling for production systems.



