
Software Engineer, GPU Infrastructure - HPC

Software Engineer, GPU Infrastructure - HPC
OpenAI
OpenAI is seeking a Software Engineer for its Fleet High Performance Computing (HPC) team, responsible for ensuring the reliability and uptime of the compute fleet that powers its research and products. The role involves building automation systems, monitoring server health, and collaborating with various teams to maintain high performance and efficiency in supercomputing infrastructure.
Qualification
- Experience managing large-scale server environments.
- Proficiency in Python, Go, or similar languages.
- Strong Linux, networking, and server hardware knowledge.
- Comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool.
- A balance of strengths in building and operationalizing.
Responsibility
- Build and maintain automation systems for provisioning and managing server fleets.
- Develop tools to monitor server health, performance, and lifecycle events.
- Collaborate with clusters, networking, and infrastructure teams.
- Partner with external operators to ensure a high level of quality.
- Identify and fix performance bottlenecks and inefficiencies.
- Continuously improve automation to reduce manual work.




