Middle DevOps Engineer

Indeed

Full-time

Onsite

No experience limit

No degree limit

79Q22222+22

Favourites

Description

Summary: Seeking a Middle DevOps Engineer to manage Kubernetes and Linux platforms, focusing on Volcano scheduling and automated infrastructure operations for stable, efficient AI compute environments. Highlights: 1. Manage Kubernetes administration and GPU clusters for AI compute environments. 2. Automate with Python and UNIX shell scripting for client-facing delivery. 3. Operate and optimize GPU-enabled Kubernetes clusters and Linux environments. We are enabling dependable GPU compute by operating Kubernetes and Linux platforms focused on Volcano scheduling and automated infrastructure operations. As a Middle DevOps Engineer, you will manage Kubernetes administration, run GPU clusters on Kubernetes and Linux nodes, and create automation with Python and UNIX shell scripting for a client\-facing delivery team. Apply to help deliver stable, efficient AI compute environments at scale. **Responsibilities** * Provision, configure, and operate GPU\-enabled Kubernetes clusters and standalone Linux compute environments to keep scheduling and performance optimized * Set up and administer Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement * Own Kubernetes administration across namespaces, RBAC, resource quotas, and workload isolation approaches * Automate job submission, resource provisioning, and system reporting by creating and maintaining Python and Shell scripts * Coordinate with orchestration, optimization, and observability teams to raise scheduling efficiency, improve capacity utilization, and streamline researcher workflows * Observe infrastructure health and resource utilization, supplying data and feedback for optimization and reporting needs * Improve infrastructure, tooling, and automation workflows to increase performance, scalability, and usability * Maintain operational processes that provide a smooth and efficient experience for researchers running diverse AI and computational workloads **Requirements** * Hands\-on background with 2\+ years of experience in DevOps or infrastructure engineering within complex, large\-scale environments * Expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management * Practical experience with the Volcano scheduler for GPU job execution, queue configuration, and workload prioritization integrated with Kubernetes * Proven ability to operate GPU cluster environments in Kubernetes as well as on standalone Linux compute nodes * Advanced Python scripting skills for infrastructure automation, plus proficiency in UNIX Shell scripting such as Bash * Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management * Solid understanding of infrastructure automation and orchestration concepts and related tooling * Fluent English communication skills (spoken and written) for direct client interaction **Nice to have** * Knowledge of Helm package management for Kubernetes applications * Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki * Skills in Infrastructure as Code tools such as Terraform * Background in multi\-cloud Kubernetes environments including Amazon EKS and Google GKE * Understanding of Azure Networking including VPN, ExpressRoute, and network security * Familiarity with AI\-assisted coding tools such as GitHub Copilot, ChatGPT, and Claude * Experience with hybrid (cloud and on\-premises) scheduling and resource optimization

Source: indeed View original post