Principal Software Engineer, DGX Cloud Production Engineering
Location
Santa Clara, CA
Job Type
Full-time
Category
other-general
Posted
June 06, 2026
NVIDIA DGX Cloud is scaling GPU infrastructure across internal, partner, and cloud environments. We are looking for Principal Software Engineers to help shape the technical direction for production engineering, Kubernetes-based operations, automation, and reliability across large-scale GPU clusters.
This role is for senior technical leaders who can define architecture, lead through influence, build critical systems, and turn ambiguous infrastructure problems into durable software and operating models.
What you’ll be doing:
+ Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.
+ Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.
+ Establish patterns for Kubernetes-based GPU cluster operations acro...
This role is for senior technical leaders who can define architecture, lead through influence, build critical systems, and turn ambiguous infrastructure problems into durable software and operating models.
What you’ll be doing:
+ Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments.
+ Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness.
+ Establish patterns for Kubernetes-based GPU cluster operations acro...