
Senior ML Systems Engineer, Frameworks & Tooling

Senior ML Systems Engineer, Frameworks & Tooling

Senior ML Systems Engineer, Frameworks & Tooling
Cohere
Senior ML Systems Engineer, Frameworks & Tooling at Cohere, focused on building, maintaining, and evolving the training framework for frontier-scale language models. You'll work at the intersection of large-scale training, distributed systems, and HPC infrastructure to enable fast, reliable, and scalable model training across thousands of GPUs, driving research ideas to production-ready tooling and workflows.
Qualification
- Strong engineering experience in large-scale distributed training or HPC systems.
- Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
- Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
- Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
- Experience working with containerized environments (Docker, Singularity/Apptainer).
- A track record of building tools that increase developer velocity for ML teams.
- Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
- Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.
Responsibility
- Build and own the training framework responsible for large-scale LLM training.
- Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
- Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
- Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
- Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training.
- Investigate and resolve performance bottlenecks across the ML systems stack.
- Build robust systems that ensure reproducible, debuggable, large-scale runs.




