Modal logo

Member of Technical Staff - Reliability Engineering

ModalNew York
FullTimeUSD 150,000 – 350,000 per yearpythonawsdocker+14 more
Apply Now
Modal logo

Member of Technical Staff - Reliability Engineering

Modal

Apply Now

About Us

Modal provides the infrastructure foundation for AI teams. With instant GPU access, sub-second container startups, and native storage, Modal makes it simple to train models, run batch jobs, and serve low-latency inference. We have thousands of customers who rely on us for production AI workloads, including Lovable, Scale AI, Substack, and Suno.

We're a fast-growing team based out of NYC, SF, and Stockholm. We've hit 9-figure ARR and recently raised a Series B https://modal.com/blog/announcing-our-series-b at a $1.1B valuation. Our investors include Lux Capital https://www.luxcapital.com/, Redpoint Ventures https://www.redpoint.com/, Amplify Partners https://www.amplifypartners.com/, and Elad Gil https://eladgil.com/.

Working at Modal means joining one of the fastest-growing AI infrastructure organizations at an early stage, with many opportunities to grow within the company. Our team includes creators of popular open-source projects (e.g. Seaborn https://github.com/mwaskom/seaborn, Luigi https://github.com/spotify/luigi), academic researchers, international olympiad medalists, and experienced engineering and product leaders with decades of experience.

The Role

  • Identify architectural changes to improve reliability, performance and availability.
  • Foster a culture of reliability across Modal's engineering organization.
  • Design and implement key operational processes such as deployments, upgrades, rollbacks, and postmortem review.
  • Join a core engineering team and participate in on-call rotation, responding to production incidents.
  • Build monitoring systems that ensure the highest quality service for our customers.
  • Debug production issues across all services and levels of the stack.

Requirements

  • 5+ years of experience writing high-quality production code.
  • 2+ years of on-call experience for critical production services.
  • Strong cloud skills, and deep familiarity with at least one hyperscaler cloud (AWS preferred).
  • Familiarity with auto scaling, fleet management, and capacity planning at scale.
  • Experience owning and scaling Kubernetes clusters to thousands of nodes a plus.
  • Experience with systems safety research (e.g. STAMP) and control theory a plus.
  • Ability to work in-person in our NYC, SF or Stockholm offices.

Similar Jobs