-
Responsibilities:
- Leading the SRE team, setting objectives, and guiding the team towards achieving high reliability while balancing cost and performance SLAs.
- Collaborating with platform & product engineering teams to embed reliability and operational best practices into the software development lifecycle.
- Developing and implementing SRE policies and practices, including service level objectives (SLOs), service level indicators (SLIs), and error budgets.
- Driving automation across operations to reduce toil, improve system performance, ensure scalability, with a reasonable amount of allergic response towards repetitive manual work.
- Overseeing incident management, post-mortem analyses, and root cause investigations to prevent future outages and enhance system reliability.
- Facilitating capacity planning and scalability exercises to manage growth and ensure the efficient use of resources.
- Facilitating disaster recovery plans & testing to ensure business continuity for our customers’ webstores.
- Encouraging a culture of continuous improvement by mentoring team members and fostering innovation within the team.
- Staying up to date with the latest trends and technologies in SRE and advocating for their adoption where appropriate.
-
Ready to apply for roles like this?
Unlock the company name and direct application link. Subscribers get instant access to fresh jobs across Dubai, Abu Dhabi and Riyadh, many with visa support.
Unlock employer & apply directly
Requirements:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
- At least 5 years of experience in Site Reliability Engineering, with 2+ years in a leadership or management role.
- Proven expertise in cloud computing platforms (e.g., AWS, Azure, GCP) and experience with container orchestration (e.g., Kubernetes).
- A deep understanding of network protocols, load balancing, and high availability configurations.
- Experience in applying software development solutions to SRE and familiarity with programming languages such as (preferably) PowerShell and C# or else Python, Go, Java etc.
- Experience with automation tools, infrastructure as code (e.g., Terraform, Ansible).
- Proficiency in monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack) and in implementing comprehensive monitoring solutions. Dynatrace knowledge is a plus.
- Excellent problem-solving skills, with a proven ability to tackle complex issues under pressure.
- Outstanding leadership qualities, with a track record of mentoring and developing high-performing teams.
- Exceptional communication and collaboration skills, capable of working effectively with cross-functional teams.
-
Benefits:
- The opportunity to make an impact at a fast-growing SaaS scale-up.
- Up to 3 weeks “work from anywhere” per year.
- A global and customized onboarding program (9,1/10 rated by previous hires).
- A hybrid working model – 3 days from the office, 2 days from home.
- Bi-Weekly company lunch on us (Wednesdays).
For more information about the role, please get in touch with one of our recruiters. Apply online for the fastest feedback on your application.
Contact: Jonathan Salamanca
Phone: +31 10 243 6010
#LI-JS1
#LI-Hybrid