Site Reliability Engineering Lead
5 days ago
SRE Leader JD
What you'll do
Strategy and Governance
- Formulate and implement the company-level reliability strategy and SLO/ error budgeting mechanism, and establish a reliability measurement system centered on business impact.
- Establish release and change governance (access control, canary, rollback, freeze window), and promote the quantification and standardization of change risks.
- Establish a unified incident response system (SEV classification, IM/IC command mechanism, internal and external communication), and promote no-responsibility review and systematic improvement.
Team and Organization
- Form and manage the SRE team (platform SRE, domain SRE, NOC), clarify role responsibilities and rotation mechanisms, and build an engineering culture and talent echelon
- Cross-team collaboration, working with R&D, architecture, DBA, network, security, legal/compliance, to drive the inclusion of reliability goals in the roadmap and KPIs.
Platform and Engineering Implementatio
- Observability platform: Unify metric/log/tracking norms and SDKS, build high-availability data pipelines and alarm denoising systems, emergency platforms and support order systems.
- Delivery platform: CI/CD, GitOps, feature switch, progressive release, policy check and image signature to enhance release quality and frequency.
- Capacity and Performance Engineering: Benchmark and stress testing, capacity prediction and resilience, hotspot isolation and degradation strategies, ensuring controllable degradation in extreme market conditions.
- Disaster recovery and business continuity: Multi-AZ/multi-Region architecture, RTO/RPO objectives and drills, data backup/recovery and consistency guarantee of reconciliation.
- Chaos Engineering and Fault Drills: Support core businesses in conducting fault drills, and build a chaos engineering platform to identify and promptly address potential system risks.
Exchange Scenario Special Project
- Matching and low latency: end-to-end latency SLI, matching confirmation and replay, serial number consistency and idempotence, isolation of hot trading pairs.
- Wallet and on-chain interaction: Multi-chain node operation and maintenance, congestion and reorg handling, MPC/HSM, risk control and approval flow for coin withdrawal and deposit, closed loop for reconciliation errors.
- API and Market Conditions: WS back pressure, zoning and sharding, GSLB/ nearby access, speed limit and downsampling in sudden market conditions.
- Security and Compliance: DDoS/WAF/ bot governance, dual-person review and audit of sensitive operations, meeting requirements such as SOC2/ISO 27001/PCI-DSS.
Indicators and Improvements
- Define and align reliability /KPI indicators to drive the improvement closed loop: SLO achievement, MTTA/MTTR, change failure rate, accident recurrence rate, Toil proportion, cost/ transaction volume ratio, etc.
What you'll need
- Over 8 years of experience in back-end/platform/operation and maintenance engineering, over 4 years of SRE or production engineering experience, and over 2 years of team management/leadership experience.
- There are successful cases of stability governance and incident handling in high-concurrency and low-latency businesses (transactions/payments/advertising/large-scale real-time systems).
Skills
- SLO/SLI and incorrect budgeting practices, observability system construction (Prometheus/Grafana/ELK or similar, OpenTelemetry, Tracing).
- Kubernetes/Service Mesh, microservice gateway (Nginx/Envoy), CI/CD (GitHub Actions/GitLab CI, etc.), GitOps (Argo CD).
- Design and implementation of progressive delivery (Canary/Batch/feature Switch) and automatic rollback strategies.
- Data and Storage: MySQL/ Sharding/Replication and Failover, Redis/Kafka, Backup and Disaster Recovery Drills; Consistency and reconciliation thinking.
- Performance and Capacity Engineering: Stress testing, benchmarking, analysis and tuning (flame diagram /CPU/GC/ Network /TCP kernel parameters, etc.).
- Event management: SEV grading, IM/IC command, cross-team collaboration and communication, writing high-quality retrospections and tracking action items.
Soft quality
- Strong sense of ownership and risk control mindset, data-driven, and good at balancing reliability, speed and cost.
- Outstanding cross-departmental communication and influence can drive the implementation of strategies and cultural transformation.
- Fluent in both Chinese and English reading and writing, capable of handling overseas cloud/compliance communication (if involved in overseas markets).
Bonus points (priority consideration
- Experience in exchange/matching/payment clearing and settlement/operation and maintenance of securities firms or crypto wallets and chain nodes.
- Experience in implementing anti-ddos, WAF, Bot management, rate limiting and traffic governance systems.
- Experience in compliance systems (SOC2, ISO 27001, PCI-DSS, SOX-class controls), security audits and evidence retention.
- Experience in multi-region GSLB, cross-cloud/multi-cloud architecture, Chaos engineering and GameDay organization.
- Go/Java optimization experience, practical experience in messaging systems (Kafka/RocketMQ/Pulsar) and storage (TiDB/Vitess/Citus/TDSQL, etc.).
- Have experience in cost optimization and FinOps.
Job Responsibilities:
Strategy and Governance:
- Formulate and implement a company-level reliability strategy and SLO/error budgeting mechanism, and establish a reliability measurement system centered on business impact.
- Establish release and change governance (access control, canary, rollback, freeze window) to promote the quantification and standardization of change risks.
- Establish a unified incident response system (SEV classification, IM/IC command mechanism, internal and external communication) to promote non-accountability review and systematic improvement.
Teams and Organizations:
- Establish and manage SRE teams (platform SREs, domain SREs, NOCs), clarify roles, responsibilities and rotation mechanisms, and build engineering culture and talent echelons
- Collaborate across teams to drive reliability goals into roadmaps and KPIs with R&D, architecture, DBA, networking, security, and legal/compliance.
Platform and Engineering Implementation:
- Observability platform: unified indicators/logs/tracking specifications and SDKs, building high-availability data pipelines and alarm denoising systems, emergency platforms and support single systems.
- Delivery platform: CI/CD, GitOps, feature switching, progressive release, policy checking, and image signing to improve the quality and frequency of releases.
- Capacity and performance engineering: benchmarking and stress testing, capacity prediction and elasticity, hotspot isolation and downgrading strategies to ensure the controllable degradation of extreme market conditions.
Exchange Scenario Special:
- Matching and low latency: End-to-end latency SLI, matching confirmation and playback, serial number consistency and idempotency, and isolation of hot trading pairs.
- Wallet and on-chain interaction: multi-chain node operation and maintenance, congestion and reorg processing, MPC/HSM, risk control and approval flow for deposits and withdrawals, and closed-loop reconciliation errors.
- API and market: WS back pressure, partition sharding, GSLB/nearby access, rate limiting and downsampling in burst situations.
- Security and compliance: DDoS/WAF/bot governance, two-person review and audit of sensitive operations, and SOC2/ISO 27001/PCI-DSS requirements.
Metrics and Improvements:
- Define and align reliability/KPI metrics to drive a closed loop of improvement: SLO achievement, MTTA/MTTR, change failure rate, incident recurrence rate, Toil share, cost/volume ratio, etc.
Job Requirements:
- More than 8 years of experience in back-end/platform/O&M engineering, more than 4 years of experience in SRE or production engineering, and more than 2 years of experience in team management/team leadership.
- There are successful cases of stability governance and accident handling in high-concurrency and low-latency services (transactions/payments/advertising/large-scale real-time systems).
Skill:
- SLO/SLI and error budgeting practices, observability system construction (Prometheus/Grafana/ELK or similar, OpenTelemetry, Tracing).
- Kubernetes/Service Mesh, Microservices Gateway (Nginx/Envoy), CI/CD (GitHub Actions/GitLab CI, etc.), GitOps (Argo CD).
- Progressive delivery (canary/batch/feature switch) and automatic rollback strategy design and implementation.
- Data and storage: MySQL/sharding/replication and failover, Redis/Kafka, backup and disaster recovery drills; Consistency and reconciliation thinking.
- Performance and capacity engineering: stress testing, benchmarking, analysis and tuning (flame diagrams/CPU/GC/network/TCP kernel parameters, etc.).
- Incident management: SEV hierarchy, IM/IC command, cross-team collaboration and communication, writing high-quality reviews and action item tracking.
Soft quality
- Strong sense of ownership and risk control mentality, data-driven, good at weighing reliability/speed/cost.
- Excellent cross-departmental communication and influence can drive strategy implementation and cultural change.
- Fluent in Chinese and English, able to handle overseas cloud/compliance communication (if involving overseas markets).
Bonus Points (Preferred)
- Experience in exchange/matching/payment clearing/brokerage or crypto wallet and chain node operation and maintenance.
- Experience in implementing anti-DDoS, WAF, bot management, rate limiting, and traffic governance systems.
- Experience in compliance systems (SOC2, ISO 27001, PCI-DSS, SOX control), security audits and evidence retention.
- Experience in multi-region GSLB, cross-cloud/multi-cloud architecture, Chaos engineering, and GameDay organizations.
- Go/Java optimization experience, messaging system (Kafka/RocketMQ/Pulsar) and storage (TiDB/Vitess/Citus/TDSQL, etc.) practice.
- Experience in cost optimization and FinOps.
-
Site Reliability Engineering Manager
1 week ago
Dubai, Dubai, United Arab Emirates Styli Full timeRole:Head of Site Reliability EngineeringLocation:DubaiAbout Styli Marketplace:Launched in 2019 by Landmark Group, Styli Marketplace is the first eCommerce-only fashion venture of the Group, quickly becoming a leading online destination for fashion and lifestyle across the GCC, including Saudi Arabia, UAE, Kuwait, Bahrain, and beyond. Styli connects global...
-
Site Reliability Engineer
7 days ago
Dubai, Dubai, United Arab Emirates DICETEK LLC Full timeJob SummaryWe are looking for a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our production systems. The SRE will work closely with engineering, DevOps, and product teams to build highly available systems, automate operations, and improve system observability while maintaining service level objectives (SLOs).Key...
-
Site Reliability Engineer
3 days ago
Dubai, Dubai, United Arab Emirates SECUWALL Full timeWe are seeking a Site Reliability Engineer (SRE) to ensure the reliability, performance, and security of our distributed systems across hybrid cloud environments (AWS + on-prem). This role focuses on operational excellence, automation, and implementing DevSecOps practices. You will work closely with development teams to improve system resilience, deploy...
-
Site Reliability Engineer
1 week ago
Dubai, Dubai, United Arab Emirates Dicetek LLC Full timeJob SummaryWe are looking for aSite Reliability Engineer (SRE)to ensure the reliability, scalability, and performance of our production systems. The SRE will work closely with engineering, DevOps, and product teams to build highly available systems, automate operations, and improve system observability while maintaining service level objectives (SLOs).Key...
-
Site Reliability Engineer
1 week ago
Dubai, Dubai, United Arab Emirates ManpowerGroup Middle East Full timeSite Reliability Engineer / AMS Support Engineer - Digital HealthcareOur client, a leading global healthcare technology company, is looking for an experienced Site Reliability Engineer / Application Management Services (AMS) Support Engineer to join their innovative team.The company is at the forefront of digital health, partnering with healthcare and care...
-
Head of Site Reliability
1 week ago
Dubai, Dubai, United Arab Emirates Styli Full timeCompany DescriptionLaunched in 2019 by Landmark Group, Styli is the first eCommerce-only fashion venture of the Group, quickly becoming a leading online destination for fashion and lifestyle across the GCC, including Saudi Arabia, UAE, Kuwait, Bahrain, and beyond. Styli connects global sellers and creators with millions of fashion-forward customers, offering...
-
Site Reliability Engineer
7 days ago
Dubai, Dubai, United Arab Emirates Vend Tech Group Full timeJob Title:Payments SRE (Site Reliability Engineer)Department:Service LineLocation:DubaiReporting to:Test ArchitectWe're supporting on the hiring of aPayments SREto own thestability, performance, and cost optimisationof our Payments Testing-as-a-Service (TaaS) platform. You'll manage the cloud-hostedIliad T3instances for assigned regions, ensuring service...
-
Dubai, Dubai, United Arab Emirates Amazon Full timeAmazon Operations sits at the heart of the Amazon customer experience. We look after everything from the moment a customer clicks buy, to the moment their item is delivered – from desktop to doorstep.This job will be within AMET region and UAE-Dubai based as one of Reliability maintenance engineering leaders, Working in Amazon Operations is for people who...
-
Site Engineer
43 minutes ago
Dubai, Dubai, United Arab Emirates Industrial Engineering Products Full timeEducation:Bachelor's / Diploma in Civil, Mechanical, Electrical, or relevant Engineering disciplineExperience:Min 2 years of experience as a Site Engineer in Technical Services/ Interior fit out field/ civil/ maintenance industrySite Engineer is responsible for :-Supervising site activities, coordinating technical works, preparing and reviewing drawings, and...
-
Site Civil Engineer
1 week ago
Dubai, Dubai, United Arab Emirates Archline Engineering Consultants Full timeJob Overview:We are currently looking for a qualified and proactive Site Engineer with a minimum of 2 years of experience in Dubai to join our team. The successful candidate will play a key role in overseeing on-site activities, coordinating with various teams, and ensuring that projects are completed on time, within budget, and in compliance with local...