Post a Job

HPC Engineer

Unlock employer Riyadh, Saudi Arabia Posted: 10 Jun 2026

Apply Direct

Financial

Estimate: $60k - $80k*
Zero income tax location

Accessibility

Office Only
Apply from abroad
Visa Provided

Requirements

Experience: Senior
English: Professional

Explore similar roles:

View Infrastructure Engineer jobs in Riyadh · View all Infrastructure Engineer jobs

Position

About the Job:
The company is a technology service provider and a leading outsourced partner specializing in delivering professional and managed solutions across EMEA. We are seeking an experienced Senior Infrastructure HPC Engineer who has a proven track record in designing, deploying, configuring, and operating components of a large-scale high-performance computing environment.

Ready to apply for roles like this?

Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.

Unlock employer & apply directly

Key Responsibilities:

Design, deploy, and maintain HPC clusters end-to-end, including compute nodes, storage tiers, high-speed networking (InfiniBand / RoCE), and management fabric.
Provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster imaging, OS lifecycle, and GPU fleet health monitoring.
Deploy and manage the full NVIDIA AI Enterprise Suite, including installation, licensing, updates, and integration with MLOps pipelines (NeMo, Triton, RAPIDS).
Operate NVIDIA GPU Operator and Network Operator on Kubernetes for automating driver and CUDA lifecycle, DCGM exporter, and MIG configuration.
Configure and serve NVIDIA NIM inference endpoints and implement NVIDIA Blueprint reference architectures for production AI workloads.
Install, administer, and tune Slurm for various parameters such as partitions, QOS, fair-share policies, node accounting, and MPI integration.
Bootstrap and operate Kubernetes clusters with kubeadm, ensuring control plane HA, etcd backup, and zero-downtime upgrades.
Administer RHEL / Canonical Ubuntu across all cluster nodes.
Build and maintain CI/CD pipelines (GitLab CI / GitHub Actions) for infrastructure provisioning and HPC software delivery.
Profile and tune GPU and CPU workload performance while resolving bottlenecks across hardware, drivers, MPI fabric, and application layers.
Implement cluster monitoring using Prometheus, Grafana, and DCGM, defining alerting and capacity planning thresholds.
Enforce security best practices including node hardening, kernel patching, RBAC, and compliance audits across the HPC environment.