Our client is building the AI Native GPU Cloud—and we need a senior HPC Site Reliability Engineer to keep it humming.
You’ll own the reliability and performance of our cutting-edge Nvidia-based HPC systems. Think DGX clusters, RoCE topologies, and automation pipelines built in Ansible and Terraform. If that lights you up, read on.
This role is remote (US-based) and offers the chance to shape our infrastructure from the ground up. Expect high-impact work, loads of autonomy, and collaboration with smart folks across architecture, engineering, and ops.
You’ll:
- Set up and optimize HPC clusters and networks (think DGX, HGX, GPU Direct)
- Debug low-level networking issues with Cisco, Juniper, and more
- Automate configs with Ansible Terraform
- Monitor everything with Grafana, UFM, ELK, NetQ
- Own 24/7 reliability, on-call, and root cause analysis
- Have 6 years in HPC or networking-heavy roles
- Know BGP, EVPN, VxLAN, RDMA inside and out
- Have SRE experience in high-stakes environments
- Love solving infra puzzles at scale
Sound like your kind of challenge? Hit apply and let’s talk.