Post a Job

Site Reliability Engineer (SRE)

Unlock employer Riyadh, Saudi Arabia Posted: 02 Jul 2026

Apply Direct

Financial

Estimate: $90k - $120k*
Zero income tax location

Accessibility

Office Only
Apply from abroad
Visa Provided

Requirements

Experience: Intermediate
English: Professional

Explore similar roles:

View Site Reliability Engineer jobs in Riyadh · View all Site Reliability Engineer jobs

Position

About the role
We are hiring an SRE focused on observability, automation, and runtime reliability for AI platforms and internal agentic systems. This is not a generic SOC role. It is an engineering role for someone who builds telemetry, automates findings-to-fix loops, improves production readiness, and keeps AI systems measurable, resilient, and controllable in production.

Ready to apply for roles like this?

Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.

Unlock employer & apply directly

Location
Riyadh, Saudi Arabia

Responsibilities

Design and operate the telemetry and observability layer for AI platforms, including audit trails, tool-call logs, correlation IDs, traces, and runtime visibility across service boundaries.
Build automated findings-to-fix loops for AI and cloud platforms, integrating signals from tooling such as Wiz, Astrix, or future AI security products into pragmatic remediation workflows.
Implement reliability and hardening controls for internal AI systems, including alerting, health checks, rollback drills, kill-switch validation, rate limiting, and drift detection.
Codify detections, policies, and operational checks as code where they reduce toil, prevent regressions, and improve platform control.
Review platform and AI-application changes from a reliability and application-hardening perspective, especially around secrets, telemetry, external calls, risky MCP usage, and production readiness.
Own AI-platform-specific operational readiness and partner with central IT / EAS / SOC teams for escalations, handoffs, and shared incident workflows when needed.
Continuously improve production readiness through automation, post-incident learning, and repeatable playbooks for AI runtime issues.

Requirements

3+ years in SRE, production engineering, platform operations, or security automation with strong coding ability.
Hands-on scripting and coding experience, especially Python, with comfort working against APIs, log pipelines, and automation workflows.
Experience building pragmatic observability and alerting systems in AWS or comparable cloud environments.
Ability to reduce operational toil through automation while keeping signal quality high and false positives manageable.
Comfortable with incident handling, rollback thinking, SLA / SLO discussions, and evidence-driven postmortems.
Interest in AI systems, agent runtimes, and MCP-style integration risks is highly valuable.

Nice to have

Software engineering background beyond scripting, including code review and testing habits.
Experience with AI agent runtimes, prompt / tool telemetry, or internal platform hardening for LLM-powered systems.
Experience with privacy-aware telemetry, compliance-oriented logging, or runtime protection products.

Apply Direct