About Hydra Host
Hydra Host is a Founders Fund–backed NVIDIA cloud partner building the infrastructure platform that powers AI at scale. We connect AI Factories — high-performance GPU data centers — with the teams that depend on them: research labs training foundation models, enterprises running production inference, and developer platforms demanding scalable compute capacity.
We operate where hardware meets software — the bare metal layer where reliability, performance, and speed matter most. As AI workloads evolve faster than traditional cloud infrastructure can adapt, Hydra is building the foundation layer that makes it all possible.
The Role
As an AI Solutions Engineer, you’ll ensure our AI Platform and Enterprise customers have an exceptional technical experience from first deployment to scale. You’ll work at the intersection of customer enablement, infrastructure engineering, AI performance optimization, and developer enablement — helping teams build reliable, high-performance AI platforms on top of Hydra.
This role is about building the future before our customers ask for it. You’ll prototype and validate proof-of-concept neo-clouds on top of Hydra — standing up real AI platforms, running real workloads, and uncovering sharp edges before our AI Platform customers ever see them. These POCs exist to demonstrate customer value & pressure-test Hydra’s infrastructure, APIs, SDKs, and workflows.
What you learn becomes product improvements, best practices, reference architectures, SDK's, template, and default configurations across Hydra.
You’ll also serve as a technical face of Hydra to the developer ecosystem — showing what’s possible, how to do it, and why it matters.
Qualities
- You learn faster than anyone else around you
- You are constantly picking up new technologies & trying them out
- You are a great communicator and can translate well with customers
- You know how to stand up / fully configure AI on bare metal Linux better than anyone
- You can move up and down the stack fluidly
- You are comfortable taking on large challenges & finding your way through the jungle to find the best solutions
- You are ahead of the pack, and can stay ahead of our AI Platform customers
What You’ll Do
- Prototype and operate proof-of-concept AI platforms and neo-clouds on top of Hydra using the Brokkr API — to validate the developer experience.
- Build and maintain an open-source “neo-cloud in a box” reference implementation that demonstrates multi-tenancy, spin servers up & down based demand, expose containerized or virtualized GPU access
- Dogfood Hydra’s API's and infrastructure, and tooling to continuously, find gaps, sharp edges, and failure modes before customers do, and working with product and engineering to resolve them.
- Work closely with the API and monetization teams by incorporating direct customer feedback into feature prioritization, pricing models, and API design.
- Run and validate the latest AI platforms, inference stacks, and orchestration frameworks on Hydra to ensure first-class support.
- Collaborate closely with product and engineering to turn learnings into productized workflows, defaults, automations.
- Create targeted provisioning templates (e.g., self-managed Kubernetes, specialized inference engines, custom OS images) by researching common software stacks, licenses, and dependencies used by AI platforms
- Provide developers with high-quality technical enablement: code samples, SDK contributions, reference implementations, and clear documentation.
- Act as a technical voice for Hydra’s developer ecosystem: host webinars, write technical content, run demos, participate in events, and support hackathons showcasing what’s possible on Hydra.
- Document best practices and standardize configurations to scale customer success globally.
What We’re Looking For
Required Skills
- NVIDIA GPU Stack — Deep knowledge of NVIDIA hardware (drivers, firmware, NVLink, NCCL, CUDA, libraries), and how stack compatibility impacts performance.
- Bare Metal Linux — Strong experience in bare-metal Linux systems administration, driver stacks, and kernel options to use.
- AI Workloads — Proficiency running many various Hugging Face, PyTorch, model deployment frameworks, vLLM, and large-scale inference/training.
- AI Benchmarking - Hands-on experience benchmarking AI workloads like Megatron, etc.
- Workload Orchestration — Experience running Kubernetes clusters (CAPI), Slurm, and Ansible tools for cluster automation and workload management.
- Scripting - Solid scripting skills (e.g., shell scripts, Perl, Ruby, Python)
- Networking — OSI Layer 2/Layer 3 fundamentals (TCP/IP, DNS), VLANs, Bonding
- East / West — RoCE or Infiniband familiarity.
- Observability and Monitoring - nvidia-smi profiling, Prometheus/ Grafana or ELK stack
- Container Runtimes - Containers like Docker, Podman, Singularity
- Cloud Provisioning - Terraform, Cloud-init, etc.
Nice to Have
- HPC Clusters - Experience in HPC or large distributed training environments.
- TEE - Familiarity with Trusted Execution Environments, Intel TDX, or Confidential Compute.
- Storage Systems - familiarity with local and distributed file systems: NVMe, NFS, RAID, distributed file systems, CEPH, WEKA, VAST, DDN storage, etc.
- BMC Provisioning - MaaS, iPXE, IPMI
Developer & Ecosystem Impact
This role helps define how developers experience Hydra — through examples, reference architectures, code, and real deployments, not marketing abstractions.
You’ll influence how AI platforms evaluate, adopt, and build on Hydra by making the platform tangible, understandable, and compelling to engineers.
What We Value
- Customer Obsession – You genuinely care about solving customers’ problems, communicate proactively, and never leave users in the dark.
- Principled Thinking – You make decisions grounded in long-term value and clear principles, not shortcuts or convenience.
- Technical Curiosity – You love learning how things work, across domains and disciplines.
- Systems Thinking – You don’t just fix issues — you identify root causes and design solutions that prevent them from recurring.
Why Join Hydra Host
- Equity ownership — Meaningful stake in what we’re building together.
- Competitive salary — We pay fairly and transparently.
- Healthcare coverage — Medical, dental, vision for you and your family.
- Fully remote team — Remote-first with hubs in Phoenix, Boulder, and Miami, plus periodic team offsites.
- Direct impact — Your work will shape how thousands of GPU clusters are deployed and operated across the AI ecosystem.