




You will own reliability for core services across multiple clouds, drive automation, and mentor more junior engineers. You will partner with developer teams to embed resilience into feature delivery. Responsibilities * Define and maintain SLIs/SLOs, monitor alignment and error budget usage * Lead incident response and postmortems, implement corrective measures * Automate operations tasks via tooling (e.g. auto\-remediation, scaling rules) Build, improve, and maintain CI/CD pipelines, canary deployments, blue/green strategies * Lead technical discussions with customers to align on reliability, scalability, and performance requirements * Drive continuous platform improvements across the service lifecycle, including architecture, monitoring, and operational processes * Implement and extend observability systems (metrics, tracing, log aggregation) * Optimize performance and cost by tuning cloud services, autoscaling, resource rightsizing * Design, deploy, and operate containerized workloads using Docker and Kubernetes in production environments * Collaborate with dev teams to integrate resilience patterns (circuit breakers, bulkheading) * Participate in architecture discussions around high availability, disaster recovery * Mentor mid and junior SREs; conduct reliability design reviews Must\-have Qualifications * 5–8 years of experience in a reliability or operations role * Cloud\-agnostic certification\*\*: Terraform Associate, Certified Kubernetes Administrator (CKA), or SRE Foundation * Cloud provider certification\*\*: Professional\-level certification in AWS (Solutions Architect), Azure (Solutions Architect Expert), GCP (Professional Cloud Architect), or Oracle Cloud (Architect Professional) * Solid coding skills (Python, Go, or equivalent) * Experience with IaC, CI/CD pipelines, and monitoring/observability stacks (Prometheus, Grafana, OpenTelemetry, ELK) * Comfortable with observability stacks (Prometheus, Grafana, OpenTelemetry, ELK, Jaeger) * Experience working in distributed systems and production scale services Nice\-to\-have Skills * Exposure to multi\-cloud data replication or cross\-cloud networks * Experience with chaos engineering or fault injection


