Post a Job

Lead Engineer - HPC Operations

Unlock employer Abu Dhabi, United Arab Emirates Posted: 25 Jun 2026

Apply Direct

Financial

Estimate: $90k - $120k*
Zero income tax location

Accessibility

Office Only
Apply from abroad
Visa Provided

Requirements

Experience: Senior
English: Professional

Explore similar roles:

View Site Reliability Engineer jobs in Abu Dhabi · View all Site Reliability Engineer jobs

Position

The company, a leader in AI-powered cloud and digital infrastructure, is seeking a highly skilled Lead Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters that power large-scale AI and ML workloads. The role requires ensuring stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms.

Ready to apply for roles like this?

Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.

Unlock employer & apply directly

The ideal candidate will have deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments. Responsibilities include collaborating with multidisciplinary teams, leading complex projects, implementing cutting-edge technologies, and mentoring operations engineers.

Responsibilities:

Oversee daily operational management of HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes).
Optimize efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime.
Serve as the primary technical escalation point for L2 support teams.
Monitor system health, performance, and resource utilization with advanced tools (e.g., Prometheus, Grafana).
Manage user environments for AI/ML workloads.
Define and enforce job scheduling policies within Slurm and/or Kubernetes environments.
Conduct root cause analysis (RCA) of operational issues.
Mentor junior engineers and foster skills development.

Qualifications:

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
Minimum of 8 years of experience in HPC operations, systems engineering, or DevOps roles, including at least 2 years in a leadership capacity.
Advanced expertise in configuring, optimizing, and maintaining complex HPC environments.
Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
Proficient with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.
Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform.
Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies, and storage solutions.

What Working at the company Offers:

Competitive salary based on skills and experience.
Performance-based yearly bonus.
Access to exclusive discount cards providing benefits across various services.
Comprehensive health coverage, including dental, vision, and life insurance for you and your family.
Access to top-tier learning platforms for career growth.

Apply Direct