OpenAI logo

Software Engineer, GPU Infrastructure - HPC

OpenAISan Francisco
FullTimepythongolinux+5 more
Apply Now
OpenAI logo

Software Engineer, GPU Infrastructure - HPC

OpenAI

Apply Now

OpenAI is seeking a Software Engineer for its Fleet High Performance Computing (HPC) team, responsible for ensuring the reliability and uptime of the compute fleet that powers its research and products. The role involves building automation systems, monitoring server health, and collaborating with various teams to maintain high performance and efficiency in supercomputing infrastructure.

Qualification

  • Experience managing large-scale server environments.
  • Proficiency in Python, Go, or similar languages.
  • Strong Linux, networking, and server hardware knowledge.
  • Comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool.
  • A balance of strengths in building and operationalizing.

Responsibility

  • Build and maintain automation systems for provisioning and managing server fleets.
  • Develop tools to monitor server health, performance, and lifecycle events.
  • Collaborate with clusters, networking, and infrastructure teams.
  • Partner with external operators to ensure a high level of quality.
  • Identify and fix performance bottlenecks and inefficiencies.
  • Continuously improve automation to reduce manual work.

Similar Jobs