

GPU Systems Engineer
wehrtyou
Hudson River Trading (HRT) is seeking GPU Systems Engineers to enhance their HPC/AI research environment. The role involves collaborating with experts to manage large-scale infrastructure, including GPU clusters and petabyte-scale storage, ensuring 24/7 operation of trading and research systems.
Qualification
- 5+ years of experience in large-scale Linux systems engineering in HPC, AI or distributed infrastructure roles
- Extensive experience in Linux system installation, performance tuning, and troubleshooting
- Expertise in troubleshooting distributed GPU workloads
- Deep knowledge around GPU optimization and performance
- Proficiency in Python scripting and automation frameworks
- Experience with NVIDIA technologies beyond CUDA, such as NCCL, GPUDirect RDMA, and NVLink
- Familiarity with configuration management tools (e.g. Salt, Ansible, Puppet, Chef)
- Comfortable diagnosing complex system issues at the hardware, OS, and network levels
- Strong communication and organizational skills; able to collaborate across diverse technical teams
- Thrive in fast-paced environments and excited by high-impact work
Responsibility
- Design, build, and optimize large-scale distributed GPU compute clusters
- Identify and resolve GPU workloads’ performance bottlenecks across compute, storage, and networking layers
- Collaborate with research and development teams to profile, benchmark, and fine-tune GPU-based workloads
- Automate system deployment, monitoring, and troubleshooting across thousands of nodes
- Collaborate with research and engineering teams to support evolving workloads
- Own critical infrastructure projects — from concept to implementation and support
- Test and deploy new hardware and software, and partner with vendors to resolve complex issues



