The role is responsible for ensuring the reliability, availability, and performance of a company's website or application. They work closely with the development and operations teams to build and maintain a scalable and robust infrastructure that supports the company's business goals. The SRE is responsible for monitoring, troubleshooting, and resolving any issues that arise, as well as implementing automation and improvement initiatives to optimize system performance.
Responsibilities
- Maintain services by measuring and monitoring availability, latency, and overall system health.
- Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application.
- Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems.
- Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance.
- Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents.
- Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
- Create and maintain documentation for system architecture, configuration, and troubleshooting procedures.
- Perform capacity planning and resource allocation to ensure optimal system performance and scalability.
- Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards.
- Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering.
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Qualifications
- A tertiary-level qualification from an internationally recognized institution
Years & Nature of Experience
- Would have 3 to 5 years of equivalent experience where required competencies and experience has been demonstrated
- An experienced professional who can deliver on difficult technical tasks
- Has project implementation experience
- Is self-sufficient at work and could be given small project responsibility
- Has provided technical supervision to junior staff in the past
Technical Competencies
- Coding languages
- Monitoring tools
- Operating systems
- Database management
Behavioural Competencies
- Problem-solving
- Communication
- Attention to detail