Post a Job

Senior Engineer - HPC Operations

Unlock employer Abu Dhabi, United Arab Emirates Posted: 22 May 2026

Apply Direct

Financial

Estimate: $80k - $120k*
Zero income tax location

Accessibility

Office Only
Apply from abroad
Visa Provided

Requirements

Experience: Senior
English: Professional

Explore similar roles:

View DevOps Engineer jobs in Abu Dhabi · View all DevOps Engineer jobs

Position

The company, a leader in AI-powered cloud and digital infrastructure, is seeking a highly skilled Senior Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads. This role will ensure a stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms.

Ready to apply for roles like this?

Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.

Unlock employer & apply directly

The ideal candidate will possess deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments. Responsibilities include collaborating with multidisciplinary teams, leading complex projects, implementing cutting-edge technologies, and providing mentorship to operations engineers.

Key Responsibilities:

Lead the daily operational support of HPC infrastructure including compute, storage, networking, and scheduler components (Slurm, Kubernetes, etc.).
Maximize the efficiency and performance of HPC systems, ensuring optimal resource utilization and minimal downtime.
Act as the primary technical escalation point for L2 support teams and ensure prompt resolution of incidents and service requests.
Monitor system health, performance, and utilization using advanced tools (e.g., Prometheus, Grafana, DCGM).
Manage user environments for AI/ML workloads including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).
Implement and manage job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments.
Lead root cause analysis (RCA) of operational issues and contribute to post-mortem documentation and continuous improvement efforts.
Provide mentorship and guidance to junior engineers and participate in on-call rotation if required.
Ensure compliance with security and operational policies; assist in audits and documentation for change and incident management processes.

Qualifications:

Bachelor’s or Master’s degree in Computer Science, Engineering, or related technical field.
7+ years of experience in HPC operations, systems engineering, or DevOps roles.
Advanced knowledge and expertise in configuring, optimizing, and maintaining complex HPC environments.
Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
Expert knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.
Experience with monitoring frameworks such as Prometheus, Grafana, and DCGM.
Strong scripting and automation skills (Python, Bash, Ansible, Terraform).
In-depth understanding of Linux (RHEL/CentOS/Ubuntu), networking concepts, and storage technologies.

What We Offer:

Competitive salary based on skills and experience.
Yearly performance-based bonus.
Exclusive discount cards for services.
Comprehensive health coverage, including dental, vision, and life insurance for employees and their families.
Access to premium learning platforms for career development.

Location: Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates
Work Conditions: On-site, Full-time

Apply Direct