
Software Engineer, Reliability

Software Engineer, Reliability

Software Engineer, Reliability
OpenAI
The Software Engineer, Reliability role at OpenAI involves ensuring the scalability, performance, and reliability of systems that deliver AI technology to users. The position requires collaboration across various teams to build resilient infrastructure and implement testing and automation tools, while maintaining a focus on safety and responsible AI deployment.
Qualification
- Experience in engineering reliability and empowering teams with effective tools.
- Strong problem-solving skills and a collaborative mindset.
- Ability to take ownership of problems and learn necessary skills to resolve them.
- Experience in identifying and addressing performance bottlenecks.
- Familiarity with Infrastructure as Code (IaC) practices.
Responsibility
- Design and implement solutions for infrastructure scalability to meet increasing demands.
- Build and maintain load, chaos, and synthetic testing software for reliability.
- Develop automation tools to streamline tasks and enhance system reliability.
- Manage CPU/storage, GPU, and network lifecycle for resource optimization.
- Implement fault-tolerant design patterns to minimize service disruptions.
- Develop and maintain service level objectives (SLOs) and indicators (SLIs) for system reliability.
- Collaborate with cross-functional teams to introduce new features and capabilities.
- Participate in on-call rotation for critical incident response.



