Love solving gnarly problems in AI infrastructure?Our client is building the AI Native GPU Cloud—and we need a senior HPC Site Reliability Engineer to keep it humming. You’ll own the reliability and performance of our cutting-edge Nvidia-based HPC systems. Think DGX clusters, RoCE topologies, and automation pipelines built in Ansible and Terraform. If that lights you up, read on. This role is remote (US-based) and offers the chance to shape our infrastructure from the ground up. Expect high-impact work, loads of autonomy, and collaboration with smart folks across architecture, engineering, and ops. You’ll:Set up and optimize HPC clusters and networks (think DGX, HGX, GPU Direct)Debug low-level networking issues with Cisco, Juniper, and moreAutomate configs with Ansible TerraformMonitor everything with Grafana, UFM, ELK, NetQOwn 24/7 reliability, on-call, and root cause analysisThis role is perfect if you:Have 6 years in HPC or networking-heavy rolesKnow BGP, EVPN, VxLAN, RDMA inside and outHave SRE experience in high-stakes environmentsLove solving infra puzzles at scaleBonus points for CCIE/JNCIS, InfiniBand, or cloud/HPC interconnect experience. Sound like your kind of challenge? Hit apply and let’s talk.
Francis Alexander