Post a Job

Site Reliability Engineer

Unlock employer Abu Dhabi, United Arab Emirates Posted: 02 Aug 2025

Apply Direct

Financial

Estimate: $70k - $90k*
Zero income tax location

Accessibility

Office Only
No Relocation Support
Visa Provided

Requirements

Experience: Senior
English: Professional

Explore similar roles:

View Site Reliability Engineer jobs in Abu Dhabi · View all Site Reliability Engineer jobs

Position

The Site Reliability Engineer (SRE) will lead an SRE squad focused on enhancing service reliability, performance, and scalability. The role involves driving automation to reduce toil, optimize system uptime, and manage incident resolution efforts. Responsibilities include building monitoring systems, optimizing infrastructure, and implementing safe deployment practices, ensuring alignment with SLAs/SLOs, and contributing to system development and code reviews.

Ready to apply for roles like this?

Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.

Unlock employer & apply directly

Key Responsibilities:

Team Leadership & Reporting: Lead the SRE squad, represent the team in senior management briefings, and produce dashboards and progress reports.
Toil Reduction & Automation: Identify and eliminate repetitive tasks through automation to enhance efficiency and service reliability.
Service Reliability & Uptime: Maintain and improve service availability in line with SLAs/SLOs, design failover strategies, and harden systems.
Performance & Latency Optimization: Use profiling tools, distributed tracing, load testing, and bottleneck analysis to enhance performance and reduce latency.
Change & Deployment Management: Implement safe deployment practices such as canary releases and blue-green deployments, ensuring minimal risk and rapid rollback options.
Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyzes (RCA), and develop response playbooks to reduce MTTR.
Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.

Education & Experience:

7+ years of experience with data structures/algorithms and software development in two or more programming languages; 3+ years of experience in a DevOps or SRE role.
Experience in computing, distributed systems, storage, or networking.
Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
Ability to debug, optimize code, and automate routine tasks.
Strong communication skills to articulate technical issues in terms of business risk and opportunity.
Knowledge of cloud computing, data centers, networks, and virtual infrastructure.

Work Conditions: