Site Reliability Engineer

7 days ago


Dubai, Dubai, United Arab Emirates Canonical Full time
Roles and responsibilities
  • A Site Reliability Engineer (SRE) is responsible for ensuring that a company's systems, services, and infrastructure are reliable, scalable, and efficient. The role is a hybrid between software engineering and operations, with an emphasis on improving the reliability and performance of services through automation, monitoring, and proactive issue resolution.

    SREs work to ensure that applications and systems are available and performant, typically using a combination of software engineering practices, system administration, and deep monitoring of system health. They also create systems to reduce manual intervention and automate processes to increase efficiency and uptime.

    Key Responsibilities

    1. System Reliability and Performance

  • Monitoring and Incident Management: Set up and maintain monitoring tools (e.g., Prometheus, Grafana, Datadog) to track system performance, uptime, and error rates. Quickly identify issues and mitigate service outages by responding to incidents.
  • Service-Level Objectives (SLOs): Define and manage Service-Level Objectives (SLOs) and Service-Level Indicators (SLIs) to measure and maintain system reliability, ensuring that services meet business and customer expectations.
  • Incident Response: Respond to production incidents, troubleshoot issues, and minimize downtime. After incidents, perform post-mortem analyses to identify root causes and prevent recurrence.
  • Capacity Planning: Ensure the systems are capable of scaling with the growing load, handling spikes in demand, and maintaining performance during high traffic periods. Plan for scaling resources based on traffic projections and historical usage patterns.
  • 2. Automation and Infrastructure as Code (IaC)

  • Automation of Repetitive Tasks: Write scripts and create automation tools to replace manual processes, such as deployments, monitoring, and scaling. This may involve using tools like Ansible, Terraform, or Kubernetes.
  • Infrastructure Management: Implement and manage infrastructure as code (IaC) practices to provision, configure, and manage cloud infrastructure (e.g., AWS, GCP, Azure) and on-premises resources, using tools like Terraform, CloudFormation, or Kubernetes.
  • Continuous Integration and Continuous Delivery (CI/CD): Build and maintain CI/CD pipelines to automate software deployments, ensuring that changes are automatically tested, validated, and pushed to production.
  • 3. Reliability and System Optimization

  • Root Cause Analysis: After an incident, conduct a thorough post-mortem and root cause analysis to understand why failures occurred and how to prevent them in the future. Share findings with stakeholders and implement corrective actions.
  • Performance Tuning: Continuously optimize the performance of services by tuning servers, databases, networking, and application code to reduce latency and increase throughput.
  • Disaster Recovery Planning: Design, implement, and test disaster recovery strategies to ensure that systems can quickly recover from major failures or outages.
  • 4. Collaboration and Communication

  • Cross-Functional Collaboration: Work closely with development teams to integrate reliability and performance into the development lifecycle. Provide feedback to developers on how to improve the reliability and operability of their services.
  • Documentation: Write and maintain clear documentation of SRE practices, incident response procedures, system configurations, and infrastructure as code (IaC) guidelines to ensure the reliability processes are well-understood by the broader team.
  • Change Management: Participate in change management processes, ensuring that changes to production environments are well-planned and minimize risk to system availability.
  • 5. Security and Compliance

  • Security Best Practices: Implement security practices in system design and operations, ensuring that the systems are protected against vulnerabilities and threats. Monitor for potential security incidents and address them proactively.
  • Compliance: Ensure that the systems comply with relevant regulatory requirements (e.g., GDPR, HIPAA) by incorporating compliance controls and audits into operations.
  • 6. Cost Management

  • Cost Optimization: Monitor cloud and infrastructure costs, recommending cost-effective solutions while balancing performance and scalability. Implement best practices to reduce unnecessary costs related to resources and services.
Desired candidate profile

1. Technical Skills

  • Programming and Scripting: Proficiency in programming languages (e.g., Python, Go, Ruby, Java, or Bash) to automate tasks, build tools, and analyze systems.
  • Cloud Platforms: Expertise with cloud computing platforms (e.g., AWS, Google Cloud Platform, Microsoft Azure) and related services such as load balancing, storage, and virtual machines.
  • Infrastructure Automation: Familiarity with Infrastructure as Code (IaC) tools such as Terraform, Ansible, Puppet, or Chef to manage infrastructure resources.
  • Containers and Orchestration: Experience with containerization (e.g., Docker) and container orchestration systems (e.g., Kubernetes) to manage deployments and scale services.
  • Monitoring Tools: Experience using monitoring and alerting tools like Prometheus, Grafana, Datadog, or New Relic to ensure system reliability and performance.
  • CI/CD Pipelines: Knowledge of building and maintaining continuous integration/continuous deployment pipelines using tools such as Jenkins, GitLab CI, CircleCI, or Travis CI.

2. Problem-Solving and Troubleshooting

  • Incident Management: Expertise in diagnosing and troubleshooting complex issues in production systems, from applications to infrastructure, often under time pressure.
  • Root Cause Analysis: Strong problem-solving skills to conduct root cause analysis and determine long-term solutions to systemic problems.
  • Performance Tuning: Ability to analyze system performance, identify bottlenecks, and implement improvements to increase efficiency.

3. Communication Skills

  • Cross-Functional Collaboration: Strong communication skills to collaborate effectively with software development, product, and operations teams to build reliable systems.
  • Documentation: Ability to write detailed and clear documentation for both technical and non-technical stakeholders.

4. Soft Skills

  • Attention to Detail: Precision in tracking and managing various system components, ensuring that no detail is overlooked.
  • Time Management: Ability to manage multiple priorities, incidents, and projects, ensuring timely responses and task completion.
  • Resilience and Calm Under Pressure: Ability to remain calm and focused during high-stress situations, especially during incidents or outages.

Experience and Qualifications

1. Experience

  • Relevant Experience: Typically 3-5 years of experience in system administration, DevOps, software engineering, or infrastructure management, with a focus on reliability and performance.
  • Incident Management: Experience in handling high-impact incidents, including troubleshooting, mitigating, and conducting post-mortem analyses.


  • Dubai, Dubai, United Arab Emirates Canonical Full time

    Roles and ResponsibilitiesA Site Reliability Engineer (SRE) is responsible for ensuring that a company's systems, services, and infrastructure are reliable, scalable, and efficient. The role is a hybrid between software engineering and operations, with an emphasis on improving the reliability and performance of services through automation, monitoring, and...


  • Dubai, Dubai, United Arab Emirates Canonical Full time

    Roles and Responsibilities1. A Site Reliability Engineer (SRE) is responsible for ensuring that a company's systems, services, and infrastructure are reliable, scalable, and efficient. The role is a hybrid between software engineering and operations, with an emphasis on improving the reliability and performance of services through automation, monitoring, and...


  • Dubai, Dubai, United Arab Emirates noon Full time

    As a Site Reliability Engineer at noon payments, you will be responsible for automating deployments, optimizing systems, and ensuring seamless performance across our platforms. This position requires a strong foundation in cloud infrastructure management, particularly with Azure - AKS and GCP-GKE, alongside hands-on experience with Azure DevOps and...


  • Dubai, Dubai, United Arab Emirates Canonical Full time

    Roles and responsibilitiesASite Reliability Engineer (SRE) isresponsible for ensuring that a company's systems,services, and infrastructure are reliable, scalable, and efficient.The role is a hybrid between software engineering and operations,with an emphasis on improving the reliability and performance ofservices through automation, monitoring, and...


  • Dubai, Dubai, United Arab Emirates Canonical Full time

    Job OverviewA Site Reliability Engineer at Canonical ensures the company's systems, services, and infrastructure are reliable, scalable, and efficient. This role combines software engineering and operations to improve reliability and performance through automation, monitoring, and proactive issue resolution.


  • Dubai, Dubai, United Arab Emirates bhft Full time

    BHFT is a proprietary algorithmic trading firm. Our team manages the full trading cycle, from software development to creating and coding strategies and algorithms.Our trading operations cover key exchanges. The firm trades across a broad range of asset classes, including equities, equity derivatives, options, commodity futures, rates futures, etc. We employ...


  • Dubai, Dubai, United Arab Emirates noon Full time

    is a technology leader with a simple mission: to be the best place to buy and sell things. In doing this we hope to accelerate the digital economy of the Middle East, empowering regional talent and businesses to meet the full range of consumers' online needs.noon operates without boundaries; we are aggressively and voraciously ambitious. Starting in 2017...


  • Dubai, Dubai, United Arab Emirates noon Full time

    noon.com is a technology leader with a simple mission: to be the best place to buy and sell things. In doing this we hope to accelerate the digital economy of the Middle East, empowering regional talent and businesses to meet the full range of consumers' online needs.noon operates without boundaries; we are aggressively and voraciously ambitious. Starting in...


  • Dubai, Dubai, United Arab Emirates Dice Full time

    Mandatory Skills: Kubernetes, Java API, Cloud Services, DevOps ToolsOptional Skills: AWS, Agile Scrum, API GatewayClient telecommunications practice is looking for dynamic and driven professionals to join a rapidly growing high-performance team.Our client is a leading provider of digital Global System for Mobile Communications/General Packet Radio Service...


  • Dubai, Dubai, United Arab Emirates Exinity Group Full time

    In the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. And they're turning to the world's financial markets to achieve it. Exinity's mission is to empower them to succeed. We design, engineer and market a growing range of innovative trading and investing products that meet their...

  • Reliability Engineer

    4 weeks ago


    Dubai, Dubai, United Arab Emirates Najmaconsultancy Full time

    We are looking for a Reliability Engineer for an aluminium production and refinery company. The ideal candidate will play a crucial role in enhancing business processes, developing maintenance strategies, and supporting our maintenance managers in reliability analysis.ResponsibilitiesEnhances business processes.Develops maintenance strategies and...


  • Dubai, Dubai, United Arab Emirates Exinity Group Full time

    In the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. And they're turning to the world's financial markets to achieve it. Exinity's mission is to empower them to succeed. We design, engineer and market a growing range of innovative trading and investing products that meet their...


  • Dubai, Dubai, United Arab Emirates Exinity Group Full time

    In the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. And they're turning to the world's financial markets to achieve it. Exinity's mission is to empower them to succeed. We design, engineer and market a growing range of innovative trading and investing products that meet their...

  • Reliability Engineer

    2 weeks ago


    Dubai, Dubai, United Arab Emirates Najmaconsultancy Full time

    We are looking for a Reliability Engineer for an aluminium production and refinery company. The ideal candidate will play a crucial role in enhancing business processes, developing maintenance strategies, and supporting our maintenance managers in reliability analysis. Responsibilities Enhances business processes. Develops maintenance strategies and...


  • Dubai, Dubai, United Arab Emirates Najmaconsultancy Full time

    We are looking for a Reliability Engineer for an aluminium production and refinery company. The ideal candidate will play a crucial role in enhancing business processes, developing maintenance strategies, and supporting our maintenance managers in reliability analysis.Responsibilities1. Enhances business processes.2. Develops maintenance strategies and...


  • Dubai, Dubai, United Arab Emirates Exinity Group Full time

    Company OverviewIn the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. Exinity's mission is to empower them to succeed.We design, engineer and market a growing range of innovative trading and investing products that meet their expectations for choice, creativity and control, and...


  • Dubai, Dubai, United Arab Emirates Dice Full time

    Mandatory SkillsKubernetes, Java Api, Cloud Services, DevopsTools OptionalSkills Aws, Agile Scrum, ApiGateway Clienttelecommunications practice is looking for dynamic and drivenprofessionals to join a rapidly growing high-performanceteam. Our clientis a leading provider of digital Global System for MobileCommunications/General Packet Radio Service (GSM/GPRS)...


  • Dubai, Dubai, United Arab Emirates bhft Full time

    Site Reliability Engineer ResponsibilitiesAt BHFT, the Site Reliability Engineer plays a pivotal role in ensuring the reliability and performance of our trading platform. Key tasks include:Ensuring the continuous compliance of our platform with external regulatory requirements and internal standards.Developing and refining monitoring and alerting systems to...


  • Dubai, Dubai, United Arab Emirates Amazon Full time

    About the RoleWe are seeking a skilled Maintenance Technician to join our Reliability Maintenance Engineering team. As a key member of our team, you will be responsible for ensuring the safe operation of equipment within our warehouses and delivery network.Promote safe working practices and adhere to Amazon safety standards.Perform planned preventative...

  • Reliability Manager

    5 days ago


    Dubai, Dubai, United Arab Emirates Investsky Full time

    We're InvestSky, a FinTech pioneer.We're committed to making investing easy and enjoyable for everyone.Our mission is to provide a comprehensive investing platform that simplifies the investment process.As a Principal DevOps Engineer, you'll play a critical role in designing and implementing our platform's infrastructure.Your key responsibilities will...