Site Reliability Engineer, AI/ML Infrastructure
Location
toronto, on
Job Type
Full-time
Category
Other-General
Posted
May 23, 2026
Overview We2;re looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters aroundour Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers.
Youll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing, deploying, and keeping everything running smoothly. That means troubleshooting issues as they arise, monitoring performance, developing automation to make our lives easier, and working closely with engineering and science teams to ensure they have what they need. Youll also help us plan for future capacity and evaluate new technologies as we continue to scale.
Responsibilities
Manage and optimize HPC cluster operations
Deploy and maintain infrastructure-as-code solutions
Support ML/research teams with cluster usage optimization
Operate, troubleshoot and optimize Ceph storage clusters
<...
Youll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing, deploying, and keeping everything running smoothly. That means troubleshooting issues as they arise, monitoring performance, developing automation to make our lives easier, and working closely with engineering and science teams to ensure they have what they need. Youll also help us plan for future capacity and evaluate new technologies as we continue to scale.
Responsibilities
Manage and optimize HPC cluster operations
Deploy and maintain infrastructure-as-code solutions
Support ML/research teams with cluster usage optimization
Operate, troubleshoot and optimize Ceph storage clusters
<...