




Summary: Seeking a Senior Cloud Engineer to own and operate an AWS platform, building standardized infrastructure, automation, observability, and scaling for HPC workloads. Highlights: 1. Own and operate an AWS platform for HPC workloads at scale 2. Build standardized infrastructure, automation, and observability 3. Lead technical ownership and drive standards across teams We are looking for a **Senior Cloud Engineer** to own and operate an AWS platform that enables an HPC team to run workloads reliably at scale. You will build standardized infrastructure, automation, observability, and scaling across multi\-account AWS and Kubernetes—apply to help deliver robust cloud foundations. **Responsibilities** * Own the AWS environment and platform operations that support HPC workloads at scale * Provision and manage AWS accounts via internal self\-service tooling and standardized patterns * Build and maintain Terraform code to provision AWS resources and HPC\-oriented clusters * Design and operate centralized CI/CD pipelines to manage all accounts and clusters from a single repository * Migrate remaining AWS accounts into the central repository and standardize infrastructure patterns * Operate and support an in\-cluster container registry (Harbor) and related platform components * Implement and complete observability rollout across the AWS environment, including metrics, logs, dashboards, and alerting * Support Kubernetes cluster operations and troubleshoot platform issues impacting HPC workloads * Own and improve Cast AI as the primary mechanism for cluster scaling and optimization * Design and support cross\-cloud data transfer and networking solutions such as AWS DataSync and Interconnect between AWS and GCP * Collaborate with the HPC team to translate requirements into implemented platform solutions * Coordinate working hours to maintain at least 4 hours overlap with Houston time zone and occasional overlap with Australia **Requirements** * 3\+ years of hands\-on experience with Amazon Web Services in multi\-account environments * Infrastructure\-as\-code experience with Terraform (HCL/tofu), including modules and state * Kubernetes operations experience, including troubleshooting clusters and workloads * Proven ability to lead technical ownership as a staff\-level individual contributor and drive standards across teams * Strong project execution skills to take requirements, evaluate options, and deliver solutions with minimal guidance * Advanced programming skills in Python for automation, tooling, and integrations * Strong scripting skills in Bash for operational automation * Solid CI/CD and GitOps workflow knowledge using tools such as GitLab CI or GitHub Actions * Strong observability skills across metrics, logs, dashboards, and alerting using Prometheus and Grafana * Experience with cluster scaling and cost optimization using Cast AI or similar tooling * Ability to use AI\-assisted tools for code generation, debugging, and documentation in daily work * Upper\-Intermediate English proficiency (CEFR B2\) **Nice to have** * Google Cloud Platform experience, especially in cross\-cloud integrations with AWS * High\-performance computing (HPC) experience with schedulers or data\-intensive pipelines


