Company logo hidden

Principal Site Reliability Engineer

Unlock employer Abu Dhabi, United Arab Emirates Posted: 30 Apr 2026

Financial

  • Estimate: $95k - $120k*
  • Zero income tax location

Accessibility

  • Office Only
  • Visa Provided

Requirements

  • Experience: Senior
  • English: Professional

Position

The company, a leader in AI-powered cloud and digital infrastructure, is seeking a Principal Site Reliability Engineer to architect and lead the evolution of our globally distributed infrastructure supporting AI and private cloud workloads. This high-impact technical leadership role is centered on building scalable, resilient, and self-healing platforms through advanced automation and AIOps.

Ready to apply for roles like this?

Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.

Unlock employer & apply directly

Key Responsibilities:

  • Platform Architecture & Strategy:

    • Define and lead the long-term roadmap for infrastructure, CI/CD, and Kubernetes platforms
    • Design scalable, distributed systems aligned with AI/ML and HPC workloads
    • Establish standards for infrastructure-as-code and platform engineering
  • Automation & AIOps:

    • Design and implement AI-driven automation and self-healing systems
    • Develop autonomous workflows for incident remediation and capacity optimization
    • Evolve observability into predictive AIOps capabilities
  • Kubernetes & Infrastructure Engineering:

    • Architect high-performance Kubernetes environments for multi-tenancy and GPU-intensive workloads
    • Optimize infrastructure for performance, scalability, and cost efficiency
    • Support advanced scheduling and orchestration frameworks for AI workloads
  • Observability & Reliability:

    • Build and enhance observability platforms integrating metrics, logs, and tracing
    • Define SLOs/SLIs aligned with business outcomes
    • Lead root cause analysis (RCA) and promote reliability best practices including error budgets
  • Leadership & Technical Excellence:

    • Act as the escalation point for complex system issues
    • Mentor and develop SRE and DevOps teams, driving a culture of excellence
    • Lead architectural reviews and contribute to internal Centers of Excellence
  • Cross-Functional Collaboration:

    • Partner with product and engineering teams to balance innovation with reliability
    • Translate technical challenges into business impact for senior stakeholders
    • Influence infrastructure and platform strategy across the organization

Required Qualifications & Experience:

  • 10+ years of experience in Site Reliability Engineering, Platform Engineering, or Systems Architecture
  • Proven experience designing and operating large-scale distributed systems
  • Deep expertise in Kubernetes environments (EKS, GKE, or bare metal), including GPU workloads
  • Strong programming skills in Python, Go, or Rust
  • Extensive experience with Terraform, Helm, and infrastructure-as-code practices
  • Strong understanding of observability systems (metrics, logging, tracing)

Preferred Qualifications:

  • Experience with AI/ML infrastructure, including model serving and data pipelines
  • Familiarity with scheduling frameworks (e.g., Ray, Kueue, Volcano)
  • Experience building automation or AI-driven operational tools
  • Certifications such as CKA, AWS/Azure Solutions Architect
  • Experience influencing technical strategy across large organizations

What We’re Looking For:
A highly experienced and forward-thinking engineer with deep technical expertise and a passion for building resilient, scalable systems. You should be a strong problem solver, an influential leader, and a strategic thinker who can drive innovation while maintaining operational excellence.

Benefits:

  • Competitive Salary: Attractive salary package based on skills and experience
  • Yearly Bonus: Performance-based annual bonus
  • Exclusive Discount Cards: Access to special benefits with Esaad and Fazaa cards
  • Premium Family Insurance: Comprehensive health coverage for you and your family
  • Learning & Development: Access to top-tier learning platforms for career growth

This role promotes an inclusive, innovative, and collaborative work environment, grounded in values such as trust, accountability, and high performance.

Apply Direct

Jobs you might like   View all jobs

About Artificial Intelligence Company

Company details are hidden. Subscribe to view full company profile.

Ready to apply for this role?

Apply Direct