
Software Engineer, Collective Communication

Software Engineer, Collective Communication
OpenAI
The Software Engineer, Networking role at OpenAI involves designing and implementing custom networking collectives for their training stack, utilizing C++ and CUDA. The position is part of the Workload Networking team, which focuses on enhancing collective communication techniques for efficient AI model training on supercomputers. The role is hybrid, based in San Francisco, CA, and offers relocation assistance.
Qualification
- Experience with C++ and CUDA programming.
- Background in low-level performance critical software development.
- Familiarity with collective communication techniques is a bonus.
- Experience writing distributed algorithms using RDMA.
- Comfortable writing low-level performance sensitive CPU and/or GPU code.
- Familiarity with network simulation techniques.
Responsibility
- Collaborate closely with ML researchers to design and implement efficient collective operations in C++ and CUDA.
- Ensure that our largest training jobs take full advantage of the different network transports used in our supercomputers.
- Work on simulations to inform our future supercomputer network designs.
- Design and implement custom networking collectives tightly integrated into the training stack.
- Optimize low-level performance critical software for collective communication.




