

Member of Technical Staff - Reliability Engineering
Modal
About Us
Modal provides the infrastructure foundation for AI teams. With instant GPU access, sub-second container startups, and native storage, Modal makes it simple to train models, run batch jobs, and serve low-latency inference. We have thousands of customers who rely on us for production AI workloads, including Lovable, Scale AI, Substack, and Suno.
We're a fast-growing team based out of NYC, SF, and Stockholm. We've hit 9-figure ARR and recently raised a Series B https://modal.com/blog/announcing-our-series-b at a $1.1B valuation. Our investors include Lux Capital https://www.luxcapital.com/, Redpoint Ventures https://www.redpoint.com/, Amplify Partners https://www.amplifypartners.com/, and Elad Gil https://eladgil.com/.
Working at Modal means joining one of the fastest-growing AI infrastructure organizations at an early stage, with many opportunities to grow within the company. Our team includes creators of popular open-source projects (e.g. Seaborn https://github.com/mwaskom/seaborn, Luigi https://github.com/spotify/luigi), academic researchers, international olympiad medalists, and experienced engineering and product leaders with decades of experience.
The Role
- Identify architectural changes to improve reliability, performance and availability.
- Foster a culture of reliability across Modal's engineering organization.
- Design and implement key operational processes such as deployments, upgrades, rollbacks, and postmortem review.
- Join a core engineering team and participate in on-call rotation, responding to production incidents.
- Build monitoring systems that ensure the highest quality service for our customers.
- Debug production issues across all services and levels of the stack.
Requirements
- 5+ years of experience writing high-quality production code.
- 2+ years of on-call experience for critical production services.
- Strong cloud skills, and deep familiarity with at least one hyperscaler cloud (AWS preferred).
- Familiarity with auto scaling, fleet management, and capacity planning at scale.
- Experience owning and scaling Kubernetes clusters to thousands of nodes a plus.
- Experience with systems safety research (e.g. STAMP) and control theory a plus.
- Ability to work in-person in our NYC, SF or Stockholm offices.



