Anthropic logo

Research Engineer, Pretraining Scaling

AnthropicSan Francisco, CA
Full Timemachine-learningaipython+5 more
Apply Now
Anthropic logo

Research Engineer, Pretraining Scaling

Anthropic

Apply Now

Anthropic is seeking a Research Engineer for its ML Performance and Scaling team, focusing on training production pretrained models to ensure reliability and efficiency. The role involves a blend of research and engineering, requiring deep technical expertise in large-scale ML systems and a passion for the field.

Qualification

  • Hands-on experience training large language models
  • Deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems
  • Enjoy both research and engineering work, ideally with a 50/50 split
  • Strong problem-solving skills and ability to work under pressure during model launches
  • Experience with performance optimization and observability in ML systems

Responsibility

  • Own critical aspects of the production pretraining pipeline, including model operations, performance optimization, observability, and reliability
  • Debug and resolve complex issues across the full stack—from hardware errors and networking to training dynamics and evaluation infrastructure
  • Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
  • Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams
  • Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
  • Add new capabilities to the training codebase, such as long context support or novel architectures
  • Collaborate closely with teammates across SF and London, as well as with Tokens, Architectures, and Systems teams
  • Contribute to the team's institutional knowledge by documenting systems, debugging approaches, and lessons learned

Similar Jobs