DatologyAI logo

Software Engineer, Training & Inference Infrastructure

DatologyAIRedwood City
Apply Now
DatologyAI logo

Software Engineer, Training & Inference Infrastructure

DatologyAI

Apply Now

DatologyAI is seeking a Software Engineer to build and maintain large-scale training and inference infrastructure for machine learning. The role involves collaborating with researchers and product engineers to optimize performance and reliability of ML systems. The company is based in Redwood City, CA, and operates in-office four days a week.

Qualification

  • At least 5 years of professional software engineering experience.
  • Expertise in Python and experience with deep learning frameworks (PyTorch preferred).
  • Understanding of modern ML architectures and optimization for performance, particularly for training and/or inference.
  • Familiarity with inference tooling like vLLM, SGLang, or custom solutions.
  • Experience with cloud environments and resource orchestration.

Responsibility

  • Architect and maintain reliable, scalable, and cost-efficient training infrastructure.
  • Build robust model serving infrastructure for low-latency, high-throughput inference across heterogeneous hardware.
  • Automate resource orchestration and fault recovery across GPUs, networking, OS, drivers, and cloud environments.
  • Partner with researchers to productionize new models and features quickly and safely.
  • Optimize training and inference pipelines for performance, reliability, and cost.
  • Ensure all infrastructure meets high standards for reliability, security, and observability.

Similar Jobs