
Software Engineer, Frontier Systems

Software Engineer, Frontier Systems
OpenAI
The Frontier Systems team at OpenAI is responsible for building and maintaining the largest supercomputers used for advanced AI model training. The Software Engineer role focuses on developing critical infrastructure to ensure the reliability and efficiency of these systems, minimizing disruptions during large-scale training runs.
Qualification
- 7+ years of industry experience in software engineering
- Proficiency in Python and shell scripting
- Comfort with SQL, PromQL, and Pandas for data analysis
- Experience in developing reproducible analyses
- Ability to balance building and operationalizing solutions
Responsibility
- Own and improve system health checks for hyperscale supercomputers
- Lead investigations into hardware failures and system-level bugs
- Build automation to monitor and resolve issues across thousands of machines
- Ensure uninterrupted operations for researchers during model training
- Develop reproducible analyses and operationalize solutions




