
Software Engineer - Site Reliability Engineer (SRE)

Software Engineer - Site Reliability Engineer (SRE)
Lovelace
Lovelace AI is seeking a highly skilled Site Reliability Engineer (SRE) to ensure the availability, scalability, and performance of AI-powered applications and infrastructure. The role involves bridging software development and operations, focusing on automation and engineering principles to maintain and improve systems.
Qualification
- 5+ years of experience in site reliability engineering, DevOps, or systems administration.
- Proven track record of managing complex infrastructure and troubleshooting production issues.
- Experience with automation tools like Terraform, Ansible, or CloudFormation.
- Strong understanding of monitoring and observability solutions.
- Ability to collaborate with software engineering teams on system design.
Responsibility
- Design, implement, and maintain monitoring, alerting, and observability solutions.
- Lead troubleshooting efforts for complex production issues and provide root cause analysis.
- Develop and maintain automation scripts and infrastructure as code using tools like Terraform and Ansible.
- Collaborate with software engineering teams to ensure new services are scalable and reliable.
- Participate in on-call rotations to respond to platform emergencies and alerts.
- Analyze system performance and recommend optimizations for scalability and efficiency.
- Implement best practices in deployment, monitoring, and incident management.
- Conduct post-incident reviews and document solutions in a knowledge base.




