
Research Engineer, Pretraining Scaling

Research Engineer, Pretraining Scaling

Research Engineer, Pretraining Scaling
Anthropic
Anthropic is seeking a Research Engineer for its ML Performance and Scaling team, focusing on training production pretrained models to ensure reliability and efficiency. The role involves a blend of research and engineering, requiring deep technical expertise in large-scale ML systems and a passion for the field.
Qualification
- Hands-on experience training large language models
- Deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems
- Enjoy both research and engineering work, ideally with a 50/50 split
- Strong problem-solving skills and ability to work under pressure during model launches
- Experience with performance optimization and observability in ML systems
Responsibility
- Own critical aspects of the production pretraining pipeline, including model operations, performance optimization, observability, and reliability
- Debug and resolve complex issues across the full stack—from hardware errors and networking to training dynamics and evaluation infrastructure
- Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
- Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams
- Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
- Add new capabilities to the training codebase, such as long context support or novel architectures
- Collaborate closely with teammates across SF and London, as well as with Tokens, Architectures, and Systems teams
- Contribute to the team's institutional knowledge by documenting systems, debugging approaches, and lessons learned



