Love solving gnarly problems in AI infrastructure?
Our client is building the AI Native GPU Cloud—and we need a senior HPC Site Reliability Engineer to keep it humming.
You’ll own the reliability and performance of our cutting-edge Nvidia-based HPC systems. Think DGX clusters, RoCE topologies, and automation pipelines built in Ansible and Terraform. If that lights you up, read on.

This role is remote (US-based) and offers the chance to shape our infrastructure from the ground up. Expect high-impact work, loads of autonomy, and collaboration with smart folks across architecture, engineering, and ops.
You’ll:

Set up and optimize HPC clusters and networks (think DGX, HGX, GPU Direct)
Debug low-level networking issues with Cisco, Juniper, and more
Automate configs with Ansible Terraform
Monitor everything with Grafana, UFM, ELK, NetQ
Own 24/7 reliability, on-call, and root cause analysis

This role is perfect if you:

Have 6 years in HPC or networking-heavy roles
Know BGP, EVPN, VxLAN, RDMA inside and out
Have SRE experience in high-stakes environments
Love solving infra puzzles at scale

Bonus points for CCIE/JNCIS, InfiniBand, or cloud/HPC interconnect experience.
Sound like your kind of challenge? Hit apply and let’s talk.

HPC Site Reliability Engineer

APPLY HERE