🚀 Site Reliability Engineer – Future-Proof Your Career in HPC 🚀

Location: Remote (Global team, global scale)
Type: Full-time
Visa Sponsorship: No

Do you get a buzz out of making complex systems hum at scale? Do you secretly enjoy shaving milliseconds off performance bottlenecks? Are you the type who believes “if it’s not automated, it’s broken”? If so, read on…
We’re on the hunt for a Site Reliability Engineer to join a high-growth team running mission-critical compute environments across the globe. This isn’t your average SRE gig — you’ll be working at the bleeding edge of infrastructure where uptime, throughput, and ultra-low latency aren’t just buzzwords, they’re survival tactics.
Why You’ll Love This Role
  • 🌍 Impact at scale – your work underpins globally distributed, performance-intensive environments used by some of the most demanding workloads on earth.
  • 🔧 Hands-on depth – from tuning kernels and optimizing I/O to running Kubernetes clusters and automating everything in sight.
  • 🧠 Upskill into HPC – get cross-trained in High Performance Computing, supported by world-class partners (yes, that includes the big names like NVIDIA 👀).
  • 🎓 Mentorship & growth – you’ll both learn and lead, helping junior engineers level up while expanding your own mastery in HPC and GPU-powered infrastructure.
What You’ll Be Doing
  • Designing, running, and evolving resilient, scalable infrastructure.
  • Optimizing Linux systems down to the BIOS, kernel, and hardware subsystems.
  • Automating infrastructure lifecycles and simplifying troubleshooting through code.
  • Keeping things observant with Prometheus, Grafana, and custom monitoring wizardry.
  • Working shoulder-to-shoulder with HPC engineers to bring next-gen performance into production.
  • Playing a key role in 24/7 operations, but backed by a team that’s big on collaboration and smart rotations.
What You Bring
  • Deep Linux wizardry (Ubuntu is your second home).
  • Strong scripting/automation chops (Python, Bash, Ansible).
  • Experience running orchestration platforms (Kubernetes, MAAS, etc.).
  • Networking fluency: TCP/IP, DNS, VLANs, routing, switching.
  • A knack for explaining complex technical decisions to both engineers and non-techies.
  • Bonus points if you’ve dabbled in HPC, GPU infra, or InfiniBand (but don’t worry if not — we’ll help you get there).
Perks & Culture
  • Remote-first and flexible, because great engineers live everywhere.
  • A team that values do → document → automate.
  • An environment that celebrates diversity, curiosity, and real talk.
  • The chance to shape the future of infrastructure where AI and HPC collide.