Chief Systems Engineer (Site Reliability)

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking a Chief Systems Engineer (Site Reliability) to lead technical strategies, ensure optimal monitoring, and guide incident resolution in production systems. Highlights: 1. Lead responses to production incidents and guide resolution efforts. 2. Drive improvements in internal tools, scripts, and automation. 3. Maintain deep understanding of system architecture and dependencies. We are looking for a **Chief Systems Engineer (Site Reliability)** to oversee and strengthen our production systems with a focus on reliability, incident resolution, and operational excellence. In this leadership role, you will guide technical strategies, perform root cause analysis, and ensure monitoring and observability are optimized across distributed systems. Join us and help shape robust, scalable, and dependable operations. **Responsibilities** * Lead responses to production incidents, prioritizing impact and guiding resolution efforts * Direct troubleshooting across applications, services, and operating systems in complex environments * Analyze logs, metrics, and alerts to perform accurate root cause determinations * Ensure thorough documentation of incidents, solutions, and procedures in ServiceNow and other systems * Coordinate closely with development teams, operations, support lines, and technical program managers * Review and enhance SOPs, escalation frameworks, SLAs, and operational workflows * Drive improvements in internal tools, scripts, and automation using Bash or Python * Maintain deep understanding of system architecture, data flows, and dependencies to anticipate issues * Identify and mitigate risks, recurring problems, and potential performance bottlenecks **Requirements** * Proven track record (7\+ years) in systems engineering, site reliability, or advanced production support * Expert troubleshooting and tracing skills in complex distributed systems environments * Comprehensive knowledge of Linux\-based systems; macOS familiarity advantageous * Deep understanding of monitoring and observability, including working with logs, metrics, and alerts * Extensive experience with incident management, escalation processes, and ServiceNow including Security Incident Response * Strong capability in documenting technical processes and incident resolutions * Background with ticketing systems such as ServiceNow or Broadcom Rally * Hands\-on experience in Bash and Python scripting * Excellent communication skills for cross\-team collaboration * Commitment to mastering entire system ecosystems beyond immediate ticket resolution **Nice to have** * Advanced scripting or programming in Python, Bash, or similar languages * Experience developing operational automation and internal tooling * Familiarity with ITSM disciplines such as Incident, Problem, and Change Management * Knowledge supporting distributed platforms or large\-scale system infrastructures * Exposure to specialized systems like image\-based processing, camera/server pipelines, or ML\-powered solutions

Source: indeed View original post