
Supercompute Infrastructure Engineer

Supercompute Infrastructure Engineer

Supercompute Infrastructure Engineer
Periodic Labs
Periodic Labs is an AI and physical sciences lab focused on building advanced models for scientific discoveries. The company is rapidly growing and values team members who take initiative and embrace learning. The Supercompute Infrastructure Engineer role involves leading the design, operation, and maintenance of large-scale compute clusters to support AI research, with a focus on orchestration, resource management, and automation.
Qualification
- Experience managing large-scale compute environments or high-performance clusters.
- Familiarity with cluster scheduling and orchestration tools such as Kubernetes (k8s) and Slurm.
- Experience with cloud environments like GCP, AWS, or Azure.
- Knowledge of observability and monitoring tools such as DataDog, Prometheus, Grafana, or VictoriaMetrics.
- Proficiency in Infrastructure as Code (IaC) tools like Terraform and Ansible.
- Experience with GitOps tools such as GitHub CI and ArgoCD.
- Experience with clusters of 5,000 GPUs or more.
Responsibility
- Lead the design, build, and operation of large-scale compute clusters for AI scientific research.
- Write software to orchestrate GPU and CPU clusters, manage resource allocation, and automate cluster lifecycle operations.
- Oversee the bringup, operations, and maintenance of compute clusters.
- Develop tools to support large-scale frontier research experiments.
- Collaborate with physicists, computational materials scientists, AI researchers, and engineers to enhance research capabilities.




