Cohere logo

Senior ML Systems Engineer, Frameworks & Tooling

CohereLondon
FullTimepythondockerkubernetes+4 more
Apply Now
Cohere logo

Senior ML Systems Engineer, Frameworks & Tooling

Cohere

Apply Now

Senior ML Systems Engineer, Frameworks & Tooling at Cohere, focused on building, maintaining, and evolving the training framework for frontier-scale language models. You'll work at the intersection of large-scale training, distributed systems, and HPC infrastructure to enable fast, reliable, and scalable model training across thousands of GPUs, driving research ideas to production-ready tooling and workflows.

Qualification

  • Strong engineering experience in large-scale distributed training or HPC systems.
  • Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
  • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
  • Experience working with containerized environments (Docker, Singularity/Apptainer).
  • A track record of building tools that increase developer velocity for ML teams.
  • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
  • Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.

Responsibility

  • Build and own the training framework responsible for large-scale LLM training.
  • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
  • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
  • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
  • Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training.
  • Investigate and resolve performance bottlenecks across the ML systems stack.
  • Build robust systems that ensure reproducible, debuggable, large-scale runs.

Similar Jobs