Presight is seeking a meticulous and expert Lead Engineer - Site Reliability to build and support the delivery model that empowers product & technology teams to develop high-quality products, improve platform infrastructure, and strengthen the reliability of products and solutions. This role is vital in defining and establishing the delivery model used in developing cutting-edge, next-generation analytics solutions and services.
Ready to apply for roles like this?
Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.
Unlock employer & apply directly
Key Responsibilities:
- Drive reliability, performance, and scalability across our infrastructure with relevant stakeholders.
- Own the SRE roadmap, guiding implementation through mentorship, code contributions, and hands-on infrastructure work.
- Partner closely with Engineering, Data Science, and Product teams to embed reliability into the development lifecycle.
- Function as the architect by leading reliability strategies across services and environments.
- Define and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets with engineering leadership.
- Lead incident response and root cause analysis.
- Implement automation to reduce toil and improve system resilience.
- Manage capacity planning, traffic forecasting, and cost optimization.
- Mentor junior and senior Site Reliability Engineers in technical and process excellence.
- Collaborate with MLOps, DevSecOps, and CloudOps teams to enforce best practices.
- Champion observability, metrics-driven decisions, and platform maturity.
- Deploy monitoring tools such as Prometheus and Grafana to track system performance.
- Ensure that system reliability adheres to security and compliance standards, especially within regulated sectors.
- Comply with QHSE (Quality Health Safety and Environment), Business Continuity, Information Security, Privacy, Risk, Compliance Management, and Governance policies and procedures.
Qualifications:
- Bachelor's Degree in Computer Engineering or related field.
- Minimum 10 years of experience in site reliability with 2 years in people management.
- Expertise in Kubernetes, CI/CD (e.g., GitLab), and infrastructure-as-code (Terraform/Helm).
- Strong experience in cloud services (Azure, AWS, or GCP).
- Experience with multi-tenant systems or high-throughput data platforms.
- Exposure to AI/ML infrastructure or MLOps pipelines.
- Proven background in SRE principles, SLIs/SLOs, and reliability-focused engineering.
- Programming proficiency in Python or Shell (preferred).
- Deep understanding of distributed systems, networking, and incident management.
- A highly detail-oriented and methodical approach to problem solving.
- Strong analytical skills and a passion for technology, troubleshooting, and customer service.
- Excellent verbal and written communication skills.
What We Look For:
Join Presight, where we foster a culture of innovation, provide outstanding career growth opportunities, and offer competitive rewards. If you are eager to explore new frontiers in AI and thrive in a dynamic environment, we welcome you to our community.
What Working at Presight Offers:
- Culture: An open, diverse, and inclusive environment that encourages personal growth and focuses on groundbreaking, industry-first innovations.
- Career: Accelerate your career through high-impact projects and access to continuous growth and learning opportunities.
- Rewards: A competitive remuneration package with various perks, including healthcare, education support, leave benefits, and more.