Site Reliability Engineer
1 week ago
ASite Reliability Engineer (SRE) isresponsible for ensuring that a company's systems,services, and infrastructure are reliable, scalable, and efficient.The role is a hybrid between software engineering and operations,with an emphasis on improving the reliability and performance ofservices through automation, monitoring, and proactive issueresolution.
SREs work to ensure that applicationsand systems are available and performant, typically using acombination of software engineering practices, systemadministration, and deep monitoring of system health. They alsocreate systems to reduce manual intervention and automate processesto increase efficiency and uptime.
KeyResponsibilities
1. SystemReliability andPerformance
- Monitoringand Incident Management: Set up and maintainmonitoring tools (e.g., Prometheus, Grafana, Datadog) to tracksystem performance, uptime, and error rates. Quickly identifyissues and mitigate service outages by responding toincidents.
- Service-LevelObjectives (SLOs): Define and manage Service-LevelObjectives (SLOs) and Service-Level Indicators (SLIs) to measureand maintain system reliability, ensuring that services meetbusiness and customerexpectations.
- IncidentResponse: Respond to production incidents,troubleshoot issues, and minimize downtime. After incidents,perform post-mortem analyses to identify root causes and preventrecurrence.
- CapacityPlanning: Ensure the systems are capable of scalingwith the growing load, handling spikes in demand, and maintainingperformance during high traffic periods. Plan for scaling resourcesbased on traffic projections and historical usagepatterns.
2. Automationand Infrastructure as Code(IaC)
- Automationof Repetitive Tasks: Write scripts and createautomation tools to replace manual processes, such as deployments,monitoring, and scaling. This may involve using tools like Ansible,Terraform, orKubernetes.
- InfrastructureManagement: Implement and manage infrastructure ascode (IaC) practices to provision, configure, and manage cloudinfrastructure (e.g., AWS, GCP, Azure) and on-premises resources,using tools like Terraform, CloudFormation, orKubernetes.
- ContinuousIntegration and Continuous Delivery (CI/CD): Buildand maintain CI/CD pipelines to automate software deployments,ensuring that changes are automatically tested, validated, andpushed to production.
3.Reliability and SystemOptimization
- RootCause Analysis: After an incident, conduct athorough post-mortem and root cause analysis to understand whyfailures occurred and how to prevent them in the future. Sharefindings with stakeholders and implement correctiveactions.
- PerformanceTuning: Continuously optimize the performance ofservices by tuning servers, databases, networking, and applicationcode to reduce latency and increasethroughput.
- Disaster RecoveryPlanning: Design, implement, and test disasterrecovery strategies to ensure that systems can quickly recover frommajor failures or outages.
4.Collaboration andCommunication
- Cross-FunctionalCollaboration: Work closely with development teamsto integrate reliability and performance into the developmentlifecycle. Provide feedback to developers on how to improve thereliability and operability of theirservices.
- Documentation:Write and maintain clear documentation of SRE practices, incidentresponse procedures, system configurations, and infrastructure ascode (IaC) guidelines to ensure the reliability processes arewell-understood by the broaderteam.
- ChangeManagement: Participate in change managementprocesses, ensuring that changes to production environments arewell-planned and minimize risk to systemavailability.
5.Security andCompliance
- SecurityBest Practices: Implement security practices insystem design and operations, ensuring that the systems areprotected against vulnerabilities and threats. Monitor forpotential security incidents and address themproactively.
- Compliance:Ensure that the systems comply with relevant regulatoryrequirements (e.g., GDPR, HIPAA) by incorporating compliancecontrols and audits intooperations.
6. CostManagement
- CostOptimization: Monitor cloud and infrastructurecosts, recommending cost-effective solutions while balancingperformance and scalability. Implement best practices to reduceunnecessary costs related to resources andservices.
1.TechnicalSkills
- Programmingand Scripting: Proficiency in programming languages(e.g., Python, Go, Ruby, Java, or Bash) to automate tasks, buildtools, and analyze systems.
- CloudPlatforms: Expertise with cloud computing platforms(e.g., AWS, Google Cloud Platform, Microsoft Azure) and relatedservices such as load balancing, storage, and virtualmachines.
- InfrastructureAutomation: Familiarity with Infrastructure as Code(IaC) tools such as Terraform, Ansible, Puppet, or Chef to manageinfrastructureresources.
- Containers andOrchestration: Experience with containerization(e.g., Docker) and container orchestration systems (e.g.,Kubernetes) to manage deployments and scaleservices.
- MonitoringTools: Experience using monitoring and alertingtools like Prometheus, Grafana, Datadog, or New Relic to ensuresystem reliability andperformance.
- CI/CDPipelines: Knowledge of building and maintainingcontinuous integration/continuous deployment pipelines using toolssuch as Jenkins, GitLab CI, CircleCI, or TravisCI.
2. Problem-SolvingandTroubleshooting
- IncidentManagement: Expertise in diagnosing andtroubleshooting complex issues in production systems, fromapplications to infrastructure, often under timepressure.
- Root CauseAnalysis: Strong problem-solving skills to conductroot cause analysis and determine long-term solutions to systemicproblems.
- PerformanceTuning: Ability to analyze system performance,identify bottlenecks, and implement improvements to increaseefficiency.
3.CommunicationSkills
- Cross-FunctionalCollaboration: Strong communication skills tocollaborate effectively with software development, product, andoperations teams to build reliablesystems.
- Documentation:Ability to write detailed and clear documentation for bothtechnical and non-technicalstakeholders.
4. SoftSkills
- Attentionto Detail: Precision in tracking and managingvarious system components, ensuring that no detail isoverlooked.
- TimeManagement: Ability to manage multiple priorities,incidents, and projects, ensuring timely responses and taskcompletion.
- Resilience and CalmUnder Pressure: Ability to remain calm and focusedduring high-stress situations, especially during incidents oroutages.
Experience andQualifications
1.Experience
- RelevantExperience: Typically 3-5years of experience in system administration,DevOps, software engineering, or infrastructure management, with afocus on reliability andperformance.
- IncidentManagement: Experience in handling high-impactincidents, including troubleshooting, mitigating, and conductingpost-mortem analyses.
Engineering,Reliability,SystemsEngineering
Employment Type :Full-time
Department / Functional Area: Engineering
Experience: years
Gender: Male
Vacancy: 1
-
Site Reliability Engineer
2 days ago
Dubai, Dubai, United Arab Emirates Discovered MENA Full timeGet AI-powered advice on this job and more exclusive features. Head of Software Engineering at Discovered MENA - Voted the region's best new talent provider Site Reliability Engineer (SRE) Location: Dubai Duration: Permanent We're currently partnered with a leading technology consultancy who are scaling their tech team. They offer a diverse work...
-
Site Reliability Engineer
19 hours ago
Dubai, Dubai, United Arab Emirates Canonical Full timeRoles and responsibilities A Site Reliability Engineer (SRE) is responsible for ensuring that a company's systems, services, and infrastructure are reliable, scalable, and efficient. The role is a hybrid between software engineering and operations, with an emphasis on improving the reliability and performance of services through automation, monitoring, and...
-
Site Reliability Engineer
4 weeks ago
Dubai, Dubai, United Arab Emirates Canonical Full timeRoles and responsibilitiesASite Reliability Engineer (SRE) isresponsible for ensuring that a company's systems,services, and infrastructure are reliable, scalable, and efficient.The role is a hybrid between software engineering and operations,with an emphasis on improving the reliability and performance ofservices through automation, monitoring, and...
-
Site Reliability Specialist
6 days ago
Dubai, Dubai, United Arab Emirates Dice Full timeOur team is seeking a highly skilled Site Reliability Engineer to support critical API Platform, DevOps, and other activities for the Digital Services Group.This role will involve providing consulting services for improved system stability, availability, performance, and reliability.Key responsibilities include:Providing input into the resolution of...
-
Site Reliability Engineer
16 hours ago
Dubai, Dubai, United Arab Emirates bhft Full timeBHFT is a proprietary algorithmic trading firm. Our team manages the full trading cycle, from software development to creating and coding strategies and algorithms.Our trading operations cover key exchanges. The firm trades across a broad range of asset classes, including equities, equity derivatives, options, commodity futures, rates futures, etc. We employ...
-
Site Reliability Engineer
3 weeks ago
Dubai, Dubai, United Arab Emirates Exinity Group Full timeIn the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. And they're turning to the world's financial markets to achieve it. Exinity's mission is to empower them to succeed. We design, engineer and market a growing range of innovative trading and investing products that meet their...
-
Site Reliability Engineer
7 days ago
Dubai, Dubai, United Arab Emirates Exinity Group Full timeIn the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. And they're turning to the world's financial markets to achieve it. Exinity's mission is to empower them to succeed. We design, engineer and market a growing range of innovative trading and investing products that meet their...
-
Reliability Engineering Specialist
7 days ago
Dubai, Dubai, United Arab Emirates Investsky Full timeAbout UsWe strive to create a seamless investing experience.Our aim is to provide a comprehensive investment solution for MENA investors.We believe that investing should be easy, efficient, and enjoyable.Job Description:Design and implement reliable systems and processes with the development team.Analyze platform performance to identify and resolve...
-
Reliability Engineer
3 weeks ago
Dubai, Dubai, United Arab Emirates Najmaconsultancy Full timeWe are looking for a Reliability Engineer for an aluminium production and refinery company. The ideal candidate will play a crucial role in enhancing business processes, developing maintenance strategies, and supporting our maintenance managers in reliability analysis.ResponsibilitiesEnhances business processes.Develops maintenance strategies and...
-
Senior Site Reliability Engineer
3 weeks ago
Dubai, Dubai, United Arab Emirates Exinity Group Full timeIn the fast-growing economies of the world, there's a new generation of ambitious younger people eager to gain financial independence. And they're turning to the world's financial markets to achieve it. Exinity's mission is to empower them to succeed. We design, engineer and market a growing range of innovative trading and investing products that meet their...
-
Reliability Engineer
1 week ago
Dubai, Dubai, United Arab Emirates Najmaconsultancy Full timeWe are looking for a Reliability Engineer for an aluminium production and refinery company. The ideal candidate will play a crucial role in enhancing business processes, developing maintenance strategies, and supporting our maintenance managers in reliability analysis. Responsibilities Enhances business processes. Develops maintenance strategies and...
-
Reliability Engineer
4 days ago
Dubai, Dubai, United Arab Emirates Najmaconsultancy Full timeWe are looking for a Reliability Engineer for an aluminium production and refinery company. The ideal candidate will play a crucial role in enhancing business processes, developing maintenance strategies, and supporting our maintenance managers in reliability analysis. Responsibilities Enhances business processes. Develops maintenance strategies and...
-
SRE Site Reliability Engineer
4 weeks ago
Dubai, Dubai, United Arab Emirates Dice Full timeMandatory SkillsKubernetes, Java Api, Cloud Services, DevopsToolsOptionalSkills Aws, Agile Scrum, ApiGatewayClienttelecommunications practice is looking for dynamic and drivenprofessionals to join a rapidly growing high-performanceteam.Our clientis a leading provider of digital Global System for MobileCommunications/General Packet Radio Service (GSM/GPRS)...
-
Platform Reliability Expert
12 hours ago
Dubai, Dubai, United Arab Emirates bhft Full timeSite Reliability Engineer ResponsibilitiesAt BHFT, the Site Reliability Engineer plays a pivotal role in ensuring the reliability and performance of our trading platform. Key tasks include:Ensuring the continuous compliance of our platform with external regulatory requirements and internal standards.Developing and refining monitoring and alerting systems to...
-
Reliability Engineer
23 hours ago
Dubai, Dubai, United Arab Emirates Amazon Full timeAbout the RoleWe are seeking a skilled Maintenance Technician to join our Reliability Maintenance Engineering team. As a key member of our team, you will be responsible for ensuring the safe operation of equipment within our warehouses and delivery network.Promote safe working practices and adhere to Amazon safety standards.Perform planned preventative...
-
Software Reliability Manager
4 days ago
Dubai, Dubai, United Arab Emirates Discovered MENA Full timeMid-Senior Level PositionWe're seeking a qualified Site Reliability Engineer to join our team and develop scalable AI and data infrastructure.The successful candidate will be responsible for architecting, implementing, and overseeing high-performance infrastructure solutions for AI and data applications.Key RequirementsBachelor's or Master's degree in...
-
Site Engineer
1 week ago
Dubai, Dubai, United Arab Emirates Estemarat Group Full timeDubai, United Arab Emirates | Posted on 02/24/2025 We are seeking a highly skilled and detail-oriented Site Engineer to oversee construction site operations, ensure project execution as per plans, and maintain quality and safety standards. The ideal candidate should have experience in site supervision, project coordination, and compliance with UAE...
-
Safety and Reliability Engineer
5 days ago
Dubai, Dubai, United Arab Emirates Jobtrack Management Services Full timeSafety and Reliability EngineerWe are seeking a highly skilled Safety and Reliability Engineer to join our team at Jobtrack Management Services.The successful candidate will be responsible for providing safety and loss prevention information and support, including interpretations of codes, standards, and practices on existing plants, facility modifications,...
-
Site Engineer
1 week ago
Dubai, Dubai, United Arab Emirates Skills Hub Recruitment Solutions Full timeOversee day-to-day operations at the project site to ensure smooth execution of Civil and MEP (Mechanical, Electrical, Plumbing) works. Ensure work is being completed according to the projects timeline, budget, and quality standards. Supervise subcontractors, laborers, and other site personnel to ensure work is performed effectively and safely. Site...
-
Site Engineer
2 days ago
Dubai, Dubai, United Arab Emirates Skills Hub Recruitment Solutions Full timeOversee day-to-day operations at the project site to ensure smooth execution of Civil and MEP (Mechanical, Electrical, Plumbing) works. Ensure work is being completed according to the projects timeline, budget, and quality standards. Supervise subcontractors, laborers, and other site personnel to ensure work is performed effectively and safely. Site...