The company, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. The Lead Engineer – HPC Operations plays a critical role in ensuring the stability, performance, and scalability of the company's high-performance computing platforms that power large-scale AI and machine learning workloads. This role is responsible for overseeing day-to-day HPC operations across compute, storage, networking, and scheduling layers, while driving automation, performance optimization, and operational excellence. You will act as a senior technical authority, collaborating closely with engineering, AI platform, security, and compliance teams, and mentoring operations engineers in a highly complex, globally distributed environment.
Ready to apply for roles like this?
Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.
Unlock employer & apply directly
Location: Abu Dhabi, Abu Dhabi, United Arab Emirates
Key Responsibilities
- Oversee daily operations of HPC infrastructure, including compute, GPU, storage, networking, and scheduler platforms (e.g., Slurm, Kubernetes).
- Drive continuous optimization of system performance, availability, and resource utilization, minimizing downtime and operational risk.
- Serve as the primary escalation point for L2 support teams, ensuring rapid diagnosis and resolution of complex incidents and service requests.
- Continuously monitor system health and performance using observability platforms such as Prometheus, Grafana, and DCGM, proactively identifying issues.
- Manage user environments for AI and ML workloads, including container orchestration (Docker, Kubernetes) and workflow platforms such as MLflow and Kubeflow.
- Define and enforce scheduling policies, priorities, partitions, and quotas within Slurm and Kubernetes to ensure fairness, efficiency, and workload optimization.
- Lead root cause analysis (RCA) activities, produce post-mortem documentation, and implement preventive and continuous improvement actions.
- Drive automation initiatives using scripting and infrastructure-as-code tools to improve reliability, repeatability, and operational efficiency.
- Provide technical leadership, mentorship, and guidance to junior engineers; contribute to skills development and operational best practices.
- Ensure adherence to security, operational, and compliance policies; support audits and maintain documentation for change, incident, and access management processes.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
- Minimum of 8 years’ experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity.
- Advanced expertise in designing, configuring, operating, and optimizing complex HPC environments, including hardware, software, and storage systems.
- Hands-on experience managing Slurm clusters and/or Kubernetes-based platforms supporting AI/ML workloads.
- Deep knowledge of GPU resource management, workload scheduling, and performance tuning for AI and machine learning use cases.
- Strong proficiency with monitoring and observability tools such as Prometheus, Grafana, and DCGM.
- Advanced scripting and automation skills using Python, Bash, Ansible, and Terraform.
- Strong Linux administration skills (RHEL, CentOS, Ubuntu) and solid understanding of high-speed networking (RDMA, InfiniBand, RoCE) and storage technologies (NFS, Lustre, Ceph).
Preferred Skills
- Experience operating large-scale, multi-tenant AI or research computing platforms.
- Familiarity with MLOps frameworks and production ML pipelines.
- Strong documentation, communication, and cross-functional collaboration skills.
- Experience working in regulated or sovereign cloud environments.