About the Role:
We are seeking a highly motivated and skilled DevOps/Site Reliability Engineer (SRE) to join our team in Abu Dhabi, United Arab Emirates. The ideal candidate will have a passion for building, deploying, and maintaining scalable, reliable systems and infrastructure. You will work closely with development teams, ensuring smooth deployment pipelines, system stability, and operational efficiency.
Key Responsibilities:
-
Infrastructure Automation & Management
- Design, implement, and maintain CI/CD pipelines to streamline development workflows.
- Design and implement scalable infrastructure for AI model deployment and management.
- Automate infrastructure provisioning and management using tools like Terraform, Ansible, or CloudFormation.
- Optimize cloud-based and on-premises resources to improve system scalability and cost efficiency.
- Manage and optimize queuing systems and real-time streaming architectures.
-
System Reliability & Monitoring
- Monitor and troubleshoot production systems to maintain uptime and performance.
- Implement robust logging and alerting solutions using tools like Prometheus, Grafana, ELK stack, or similar.
- Implement comprehensive monitoring for both system metrics and ML model performance.
- Conduct root cause analyses and post-mortem reviews to improve system reliability.
-
Collaboration & Support
- Work with development and QA teams to integrate new features into production environments seamlessly.
- Advocate for best practices in system architecture, security, and performance optimization.
- Provide on-call support for critical production systems as part of a rotation schedule.
-
Security & Compliance
- Ensure infrastructure meets security and compliance requirements (e.g., SOC2, ISO27001).
- Manage secrets and credentials securely using tools like Vault or AWS Secrets Manager.
Required Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
- Strong proficiency in at least one scripting language (e.g., Python, Bash, or Go).
- Hands-on experience with cloud platforms like AWS, Azure, or Google Cloud.
- Proficiency with containerization and orchestration tools (Docker, Kubernetes).
- Experience with CI/CD tools such as AzureDevOps, Jenkins, GitLab CI/CD, or CircleCI.
- Knowledge of monitoring and observability tools (e.g., Prometheus, Datadog, or New Relic, Grafana, PagerDuty).
- Understanding of networking concepts (DNS, load balancing, firewalls).
- Understanding of streaming architectures for real-time AI applications.
Preferred Qualifications:
- Experience with Infrastructure as Code (IaC) tools like Terraform or Pulumi.
- Knowledge of service mesh technologies (e.g., Istio, Linkerd).
- Familiarity with database administration and scaling (VectorDBs, SQL, and NoSQL).
- Previous experience in a similar role in a high-traffic production environment.
Why Join Us?
- Opportunity to work on cutting-edge technology and challenging problems.
- Collaborative work environment that values innovation and growth.
- Competitive salary, benefits, and learning opportunities.