Periodic Labs logo

Distributed Training Engineer

Periodic LabsMenlo Park, Remote
Apply Now
Periodic Labs logo

Distributed Training Engineer

Periodic Labs

Apply Now

Periodic Labs is an AI and physical sciences lab focused on building advanced models for scientific discoveries. The role of Distributed Training Engineer involves optimizing and developing large-scale distributed LLM training systems, collaborating with researchers, and contributing to open-source frameworks.

Qualification

  • Experience with training on clusters with ≥5,000 GPUs
  • Knowledge of 5D parallel LLM training
  • Familiarity with distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan
  • Ability to optimize training throughput for large scale Mixture-of-Expert models
  • Experience in AI and computational sciences

Responsibility

  • Optimize and operate large-scale distributed LLM training systems
  • Develop and maintain mid-training and reinforcement learning workflows
  • Collaborate with researchers to debug and support training processes
  • Build tools for frontier-scale experiments
  • Contribute to open-source large scale LLM training frameworks

Similar Jobs