Senior Site Reliability Engineer

Indeed

Full-time

Onsite

No experience limit

No degree limit

67M7F82C+QM

Favourites

Description

Summary: Seeking an experienced SRE to define and maintain reliability, automate operations, build CI/CD pipelines, optimize cloud services, and mentor team members. Highlights: 1. Lead incident response and drive continuous platform improvements 2. Design, deploy, and operate containerized workloads in production 3. Mentor mid and junior SREs and conduct reliability design reviews * Define and maintain SLIs/SLOs, monitor alignment and error budget usage * Lead incident response and postmortems, implement corrective measures * Automate operations tasks via tooling (e.g. auto\-remediation, scaling rules) * Build, improve, and maintain CI/CD pipelines, canary deployments, blue/green strategies * Lead technical discussions with customers to align on reliability, scalability, and performance requirements * Drive continuous platform improvements across the service lifecycle, including architecture, monitoring, and operational processes Implement and extend observability systems (metrics, tracing, log aggregation) * Optimize performance and cost by tuning cloud services, autoscaling, resource rightsizing * Design, deploy, and operate containerized workloads using Docker and Kubernetes in production environments * Collaborate with dev teams to integrate resilience patterns (circuit breakers, bulkheading) * Participate in architecture discussions around high availability, disaster recovery * Mentor mid and junior SREs; conduct reliability design reviews * 5–8 years of experience in a reliability or operations role * Cloud\-agnostic certification: Terraform Associate, Certified Kubernetes Administrator (CKA), or SRE Foundation * Cloud provider certification: Professional\-level certification in AWS (Solutions Architect), Azure (Solutions Architect Expert), GCP (Professional Cloud Architect), or Oracle Cloud (Architect Professional) * Solid coding skills (Python, Go, or equivalent) * Experience with IaC, CI/CD pipelines, and monitoring/observability stacks (Prometheus, Grafana, OpenTelemetry, ELK) * Comfortable with observability stacks (Prometheus, Grafana, OpenTelemetry, ELK, Jaeger) Experience working in distributed systems and production scale services * Nice\-to\-have Skills * Exposure to multi\-cloud data replication or cross\-cloud networks * Experience with chaos engineering or fault injection

Source: indeed View original post