Post a Job

Senior Engineer - HPC Operations

Unlock employer Abu Dhabi, United Arab Emirates Posted: 14 Jan 2026

Apply Direct

Financial

Estimate: $80k - $120k*
Zero income tax location

Accessibility

Office Only
Apply from abroad
Visa Provided

Requirements

Experience: Senior
English: Professional

Explore similar roles:

View MLOps Engineer jobs in Abu Dhabi · View all MLOps Engineer jobs

Position

We are seeking a highly skilled Senior Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads. This role ensures stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms. The ideal candidate will bring deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments.

Ready to apply for roles like this?

Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.

Unlock employer & apply directly

Responsibilities

Provide daily operational support for HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes).
Drive initiatives to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime.
Ensure the timely and effective resolution of incidents and service requests, maintaining system reliability and uptime.
Continuously monitor system health, performance, and utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM).
Manage and support user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).
Define, implement, and manage job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure fairness, efficiency, and workload optimization.
Conduct root cause analysis (RCA) of operational issues, contributing to post-mortem documentation and driving continuous improvement initiatives.
Provide mentorship and guidance to junior engineers, fostering skills development and a collaborative environment.
Participate in on-call rotation as needed.
Ensure compliance with security and operational policies, assisting with audits and maintaining documentation for change and incident management processes.

Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
Minimum of 5 years of experience in HPC operations, systems engineering, or DevOps roles.
Advanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.
Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
In-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.
Proficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.
Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform.
Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph).

Location
Abu Dhabi, Abu Dhabi, United Arab Emirates.

Apply Direct