Site Reliability Engineer

1 week ago


Dubai, Dubai, United Arab Emirates Canonical Full time
Roles and responsibilities
  • ASite Reliability Engineer (SRE) isresponsible for ensuring that a company's systems,services, and infrastructure are reliable, scalable, and efficient.The role is a hybrid between software engineering and operations,with an emphasis on improving the reliability and performance ofservices through automation, monitoring, and proactive issueresolution.

    SREs work to ensure that applicationsand systems are available and performant, typically using acombination of software engineering practices, systemadministration, and deep monitoring of system health. They alsocreate systems to reduce manual intervention and automate processesto increase efficiency and uptime.

    KeyResponsibilities

    1. SystemReliability andPerformance

  • Monitoringand Incident Management: Set up and maintainmonitoring tools (e.g., Prometheus, Grafana, Datadog) to tracksystem performance, uptime, and error rates. Quickly identifyissues and mitigate service outages by responding toincidents.
  • Service-LevelObjectives (SLOs): Define and manage Service-LevelObjectives (SLOs) and Service-Level Indicators (SLIs) to measureand maintain system reliability, ensuring that services meetbusiness and customerexpectations.
  • IncidentResponse: Respond to production incidents,troubleshoot issues, and minimize downtime. After incidents,perform post-mortem analyses to identify root causes and preventrecurrence.
  • CapacityPlanning: Ensure the systems are capable of scalingwith the growing load, handling spikes in demand, and maintainingperformance during high traffic periods. Plan for scaling resourcesbased on traffic projections and historical usagepatterns.
  • 2. Automationand Infrastructure as Code(IaC)

  • Automationof Repetitive Tasks: Write scripts and createautomation tools to replace manual processes, such as deployments,monitoring, and scaling. This may involve using tools like Ansible,Terraform, orKubernetes.
  • InfrastructureManagement: Implement and manage infrastructure ascode (IaC) practices to provision, configure, and manage cloudinfrastructure (e.g., AWS, GCP, Azure) and on-premises resources,using tools like Terraform, CloudFormation, orKubernetes.
  • ContinuousIntegration and Continuous Delivery (CI/CD): Buildand maintain CI/CD pipelines to automate software deployments,ensuring that changes are automatically tested, validated, andpushed to production.
  • 3.Reliability and SystemOptimization

  • RootCause Analysis: After an incident, conduct athorough post-mortem and root cause analysis to understand whyfailures occurred and how to prevent them in the future. Sharefindings with stakeholders and implement correctiveactions.
  • PerformanceTuning: Continuously optimize the performance ofservices by tuning servers, databases, networking, and applicationcode to reduce latency and increasethroughput.
  • Disaster RecoveryPlanning: Design, implement, and test disasterrecovery strategies to ensure that systems can quickly recover frommajor failures or outages.
  • 4.Collaboration andCommunication

  • Cross-FunctionalCollaboration: Work closely with development teamsto integrate reliability and performance into the developmentlifecycle. Provide feedback to developers on how to improve thereliability and operability of theirservices.
  • Documentation:Write and maintain clear documentation of SRE practices, incidentresponse procedures, system configurations, and infrastructure ascode (IaC) guidelines to ensure the reliability processes arewell-understood by the broaderteam.
  • ChangeManagement: Participate in change managementprocesses, ensuring that changes to production environments arewell-planned and minimize risk to systemavailability.
  • 5.Security andCompliance

  • SecurityBest Practices: Implement security practices insystem design and operations, ensuring that the systems areprotected against vulnerabilities and threats. Monitor forpotential security incidents and address themproactively.
  • Compliance:Ensure that the systems comply with relevant regulatoryrequirements (e.g., GDPR, HIPAA) by incorporating compliancecontrols and audits intooperations.
  • 6. CostManagement

  • CostOptimization: Monitor cloud and infrastructurecosts, recommending cost-effective solutions while balancingperformance and scalability. Implement best practices to reduceunnecessary costs related to resources andservices.
Desired candidate profile

1.TechnicalSkills

  • Programmingand Scripting: Proficiency in programming languages(e.g., Python, Go, Ruby, Java, or Bash) to automate tasks, buildtools, and analyze systems.
  • CloudPlatforms: Expertise with cloud computing platforms(e.g., AWS, Google Cloud Platform, Microsoft Azure) and relatedservices such as load balancing, storage, and virtualmachines.
  • InfrastructureAutomation: Familiarity with Infrastructure as Code(IaC) tools such as Terraform, Ansible, Puppet, or Chef to manageinfrastructureresources.
  • Containers andOrchestration: Experience with containerization(e.g., Docker) and container orchestration systems (e.g.,Kubernetes) to manage deployments and scaleservices.
  • MonitoringTools: Experience using monitoring and alertingtools like Prometheus, Grafana, Datadog, or New Relic to ensuresystem reliability andperformance.
  • CI/CDPipelines: Knowledge of building and maintainingcontinuous integration/continuous deployment pipelines using toolssuch as Jenkins, GitLab CI, CircleCI, or TravisCI.

2. Problem-SolvingandTroubleshooting

  • IncidentManagement: Expertise in diagnosing andtroubleshooting complex issues in production systems, fromapplications to infrastructure, often under timepressure.
  • Root CauseAnalysis: Strong problem-solving skills to conductroot cause analysis and determine long-term solutions to systemicproblems.
  • PerformanceTuning: Ability to analyze system performance,identify bottlenecks, and implement improvements to increaseefficiency.

3.CommunicationSkills

  • Cross-FunctionalCollaboration: Strong communication skills tocollaborate effectively with software development, product, andoperations teams to build reliablesystems.
  • Documentation:Ability to write detailed and clear documentation for bothtechnical and non-technicalstakeholders.

4. SoftSkills

  • Attentionto Detail: Precision in tracking and managingvarious system components, ensuring that no detail isoverlooked.
  • TimeManagement: Ability to manage multiple priorities,incidents, and projects, ensuring timely responses and taskcompletion.
  • Resilience and CalmUnder Pressure: Ability to remain calm and focusedduring high-stress situations, especially during incidents oroutages.

Experience andQualifications

1.Experience

  • RelevantExperience: Typically 3-5years of experience in system administration,DevOps, software engineering, or infrastructure management, with afocus on reliability andperformance.
  • IncidentManagement: Experience in handling high-impactincidents, including troubleshooting, mitigating, and conductingpost-mortem analyses.
Key Skills
Engineering,Reliability,SystemsEngineering
Employment Type :Full-time
Department / Functional Area: Engineering
Experience: years
Gender: Male
Vacancy: 1

  • Dubai, Dubai, United Arab Emirates Discovered MENA Full time

    Get AI-powered advice on this job and more exclusive features. Head of Software Engineering at Discovered MENA - Voted the region's best new talent provider Site Reliability Engineer (SRE) Location: Dubai Duration: Permanent We're currently partnered with a leading technology consultancy who are scaling their tech team. They offer a diverse work...


  • Dubai, Dubai, United Arab Emirates Canonical Full time

    Roles and responsibilities A Site Reliability Engineer (SRE) is responsible for ensuring that a company's systems, services, and infrastructure are reliable, scalable, and efficient. The role is a hybrid between software engineering and operations, with an emphasis on improving the reliability and performance of services through automation, monitoring, and...


  • Dubai, Dubai, United Arab Emirates Canonical Full time

    Roles and responsibilitiesASite Reliability Engineer (SRE) isresponsible for ensuring that a company's systems,services, and infrastructure are reliable, scalable, and efficient.The role is a hybrid between software engineering and operations,with an emphasis on improving the reliability and performance ofservices through automation, monitoring, and...


  • Dubai, Dubai, United Arab Emirates Dice Full time

    Our team is seeking a highly skilled Site Reliability Engineer to support critical API Platform, DevOps, and other activities for the Digital Services Group.This role will involve providing consulting services for improved system stability, availability, performance, and reliability.Key responsibilities include:Providing input into the resolution of...


  • Dubai, Dubai, United Arab Emirates bhft Full time

    BHFT is a proprietary algorithmic trading firm. Our team manages the full trading cycle, from software development to creating and coding strategies and algorithms.Our trading operations cover key exchanges. The firm trades across a broad range of asset classes, including equities, equity derivatives, options, commodity futures, rates futures, etc. We employ...


  • Dubai, Dubai, United Arab Emirates Exinity Group Full time

    In the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. And they're turning to the world's financial markets to achieve it. Exinity's mission is to empower them to succeed. We design, engineer and market a growing range of innovative trading and investing products that meet their...


  • Dubai, Dubai, United Arab Emirates Exinity Group Full time

    In the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. And they're turning to the world's financial markets to achieve it. Exinity's mission is to empower them to succeed. We design, engineer and market a growing range of innovative trading and investing products that meet their...


  • Dubai, Dubai, United Arab Emirates Investsky Full time

    About UsWe strive to create a seamless investing experience.Our aim is to provide a comprehensive investment solution for MENA investors.We believe that investing should be easy, efficient, and enjoyable.Job Description:Design and implement reliable systems and processes with the development team.Analyze platform performance to identify and resolve...

  • Reliability Engineer

    3 weeks ago


    Dubai, Dubai, United Arab Emirates Najmaconsultancy Full time

    We are looking for a Reliability Engineer for an aluminium production and refinery company. The ideal candidate will play a crucial role in enhancing business processes, developing maintenance strategies, and supporting our maintenance managers in reliability analysis.ResponsibilitiesEnhances business processes.Develops maintenance strategies and...


  • Dubai, Dubai, United Arab Emirates Exinity Group Full time

    In the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. And they're turning to the world's financial markets to achieve it. Exinity's mission is to empower them to succeed. We design, engineer and market a growing range of innovative trading and investing products that meet their...


  • Dubai, Dubai, United Arab Emirates Najmaconsultancy Full time

    We are looking for a Reliability Engineer for an aluminium production and refinery company. The ideal candidate will play a crucial role in enhancing business processes, developing maintenance strategies, and supporting our maintenance managers in reliability analysis. Responsibilities Enhances business processes. Develops maintenance strategies and...


  • Dubai, Dubai, United Arab Emirates Najmaconsultancy Full time

    We are looking for a Reliability Engineer for an aluminium production and refinery company. The ideal candidate will play a crucial role in enhancing business processes, developing maintenance strategies, and supporting our maintenance managers in reliability analysis. Responsibilities Enhances business processes. Develops maintenance strategies and...


  • Dubai, Dubai, United Arab Emirates Dice Full time

    Mandatory SkillsKubernetes, Java Api, Cloud Services, DevopsToolsOptionalSkills Aws, Agile Scrum, ApiGatewayClienttelecommunications practice is looking for dynamic and drivenprofessionals to join a rapidly growing high-performanceteam.Our clientis a leading provider of digital Global System for MobileCommunications/General Packet Radio Service (GSM/GPRS)...


  • Dubai, Dubai, United Arab Emirates bhft Full time

    Site Reliability Engineer ResponsibilitiesAt BHFT, the Site Reliability Engineer plays a pivotal role in ensuring the reliability and performance of our trading platform. Key tasks include:Ensuring the continuous compliance of our platform with external regulatory requirements and internal standards.Developing and refining monitoring and alerting systems to...

  • Reliability Engineer

    23 hours ago


    Dubai, Dubai, United Arab Emirates Amazon Full time

    About the RoleWe are seeking a skilled Maintenance Technician to join our Reliability Maintenance Engineering team. As a key member of our team, you will be responsible for ensuring the safe operation of equipment within our warehouses and delivery network.Promote safe working practices and adhere to Amazon safety standards.Perform planned preventative...


  • Dubai, Dubai, United Arab Emirates Discovered MENA Full time

    Mid-Senior Level PositionWe're seeking a qualified Site Reliability Engineer to join our team and develop scalable AI and data infrastructure.The successful candidate will be responsible for architecting, implementing, and overseeing high-performance infrastructure solutions for AI and data applications.Key RequirementsBachelor's or Master's degree in...

  • Site Engineer

    1 week ago


    Dubai, Dubai, United Arab Emirates Estemarat Group Full time

    Dubai, United Arab Emirates | Posted on 02/24/2025 We are seeking a highly skilled and detail-oriented Site Engineer to oversee construction site operations, ensure project execution as per plans, and maintain quality and safety standards. The ideal candidate should have experience in site supervision, project coordination, and compliance with UAE...


  • Dubai, Dubai, United Arab Emirates Jobtrack Management Services Full time

    Safety and Reliability EngineerWe are seeking a highly skilled Safety and Reliability Engineer to join our team at Jobtrack Management Services.The successful candidate will be responsible for providing safety and loss prevention information and support, including interpretations of codes, standards, and practices on existing plants, facility modifications,...

  • Site Engineer

    1 week ago


    Dubai, Dubai, United Arab Emirates Skills Hub Recruitment Solutions Full time

    Oversee day-to-day operations at the project site to ensure smooth execution of Civil and MEP (Mechanical, Electrical, Plumbing) works. Ensure work is being completed according to the projects timeline, budget, and quality standards. Supervise subcontractors, laborers, and other site personnel to ensure work is performed effectively and safely. Site...

  • Site Engineer

    2 days ago


    Dubai, Dubai, United Arab Emirates Skills Hub Recruitment Solutions Full time

    Oversee day-to-day operations at the project site to ensure smooth execution of Civil and MEP (Mechanical, Electrical, Plumbing) works. Ensure work is being completed according to the projects timeline, budget, and quality standards. Supervise subcontractors, laborers, and other site personnel to ensure work is performed effectively and safely. Site...