Our Company:
Hydra Host is a startup focused on building the first true marketplace for compute. We believe people should have the power to make their own decisions about which compute they use without sacrificing trust for ease of use. We offer an alternative to the cloud hyperscaler model by directly matching independent data centers and compute providers to end users. Furthermore, we have much higher diversity in providers and hardware configurations than anyone on the market. That makes it a tougher challenge, but a bigger learning and innovation opportunity. For this we are seeking a HPC Solutions Engineer to join our team and own this arena.
Our Philosophy:
Hydra Host is a people-centric, mission-driven, remote-friendly, experimental company. Results > titles. We believe in individual freedom, self-expression, and preserving an online culture of heterodoxy that has driven massive human flourishing.
Job Summary:
As a High Performance Compute Solutions Engineer on our team you will report directly to the Co-Founder & CTO and work collaboratively with other team members. You will be responsible for bespoke scoping and execution of high end compute resource orchestration for premium clients with the mission of configuring and maintaining large scale GPU clusters to best serve them. This includes installing, configuring, and monitoring dependencies required by customers and working with a wide variety of teams and workloads to ensure they can effectively leverage their large scale clusters.
Key Responsibilities:
- Work with customers in technical discovery to help define requirements and deliverables for their use cases and help them to effectively utilize distributed GPU computing resources.
- Identify and recommend the best tools for each customer, while building boilerplate and reference implementations that can be reused for subsequent customers.
- Manage GPU clusters and coordinate with IT to ensure efficient operation of NVIDIA GPUs, utilizing technologies such as InfiniBand for high-speed networking.
- Oversee the deployment and maintenance of machine learning environments using virtual storage solutions and distributed computing (HPC) tools such as SLURM.
- Automate infrastructure provisioning and management using tools such as Ansible and Terraform.
- Collaborate with data scientists and engineers to ensure seamless integration of ML models into production environments.
- Conduct performance tuning and optimization of systems to maximize throughput and reduce latency.
- Stay current with the latest industry trends in machine learning technologies and HPC to ensure the use of best practices in infrastructure setup and model development.
- Document and maintain operational procedures and system configurations.
Required Qualifications:
- 7+ years of high performance compute, distributed machine learning, GPU computing, and/or system architecture experience.
- Proficient in managing NVIDIA GPU environments and a familiarity with GPU computing frameworks and libraries.
- Strong experience with high-speed networking technologies, specifically InfiniBand.
- Experience with HPC job schedulers, preferably SLURM.
- Expertise in automating environment setup and maintenance using Ansible and Terraform.
- Demonstrated ability in deploying and managing virtual storage solutions.
- Strong coding skills in Python and familiarity with machine learning libraries and frameworks.
- Excellent problem-solving, communication, and teamwork skills.
Personal Skills:
- A strong communicator who is empathetic, personable and agreeable
- Conscientious, meticulous, organized, and self-motivated, with the ability to define goals and prioritize workloads.
- An effective team player with the ability to take ownership with a results-oriented growth mindset
Benefits:
- You will work with the most diverse hardware configurations and locations available to anyone in the industry as well as cutting edge GPU use cases
- This role is fully remote with a high accountability and high agency culture
- This role offers a competitive salary, equity and benefits
- This role offers flexible PTO