
Distributed Training Engineer

Distributed Training Engineer

Distributed Training Engineer
Periodic Labs
Periodic Labs is an AI and physical sciences lab focused on building advanced models for scientific discoveries. The role of Distributed Training Engineer involves optimizing and developing large-scale distributed LLM training systems, collaborating with researchers, and contributing to open-source frameworks.
Qualification
- Experience with training on clusters with ≥5,000 GPUs
- Knowledge of 5D parallel LLM training
- Familiarity with distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan
- Ability to optimize training throughput for large scale Mixture-of-Expert models
- Experience in AI and computational sciences
Responsibility
- Optimize and operate large-scale distributed LLM training systems
- Develop and maintain mid-training and reinforcement learning workflows
- Collaborate with researchers to debug and support training processes
- Build tools for frontier-scale experiments
- Contribute to open-source large scale LLM training frameworks



