Company logo hidden

Lead Engineer - HPC Operations

Unlock employer Abu Dhabi, United Arab Emirates Posted: 22 May 2026

Financial

  • Estimate: $90k - $120k*
  • Zero income tax location

Accessibility

  • Office Only
  • Apply from abroad
  • Visa Provided

Requirements

  • Experience: Senior
  • English: Professional

Position

The company, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced resources and partnerships, the company empowers clients to harness sovereign AI infrastructure, particularly in sectors with stringent regulatory needs. With a mission to redefine digital transformation, the company combines sovereign capabilities with scalable, high-performance compute infrastructure, positioning itself at the forefront of AI innovation in the Middle East and beyond.

Ready to apply for roles like this?

Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.

Unlock employer & apply directly

We are seeking a highly skilled Lead Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads. This role ensures stable, secure, and high-performing infrastructure by utilizing technologies such as Slurm, Kubernetes, and modern MLOps platforms. The ideal candidate will bring deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments.

Key Responsibilities:

  • Oversee the daily operational management of HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes).
  • Drive efforts to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime.
  • Serve as the primary technical escalation point for L2 support teams, ensuring rapid and effective resolution of incidents and service requests.
  • Continuously monitor system health, performance, and resource utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM).
  • Manage user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).
  • Define and enforce job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments.
  • Lead root cause analysis (RCA) of operational issues and drive continuous improvement initiatives.
  • Provide mentorship and technical guidance to junior engineers, fostering skills development and knowledge sharing.
  • Participate in on-call rotation as necessary.
  • Ensure adherence to security and operational policies, assisting in audits and maintaining documentation for change and incident management processes.

Required Skills / Qualifications:

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • Minimum of 8 years of experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity.
  • Advanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.
  • Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
  • In-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.
  • Proficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.
  • Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform.
  • Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph).

Language Requirements: Not specified.

Apply Direct

Jobs you might like   View all jobs

About Artificial Intelligence Company

Company details are hidden. Subscribe to view full company profile.

Ready to apply for this role?

Apply Direct