High-Performance AI Networking with Spectrum-X: Enabling NVIDIA’s Vision Through Zadara

NVIDIA Spectrum X

Welcome to the second entry in our blog series exploring how Zadara empowers NVIDIA Cloud Providers (NCPs) to bring the NVIDIA Software Reference Architecture for Multi-Tenant Inference Clouds to life. In this post, we focus on Spectrum-X, NVIDIA’s high-performance Ethernet networking platform, and show how Zadara leverages Spectrum-X to deliver secure, per-tenant GPU networking that meets the demands of scalable, multi-tenant AI infrastructure.

The Role of High-Performance Networking in AI Clouds

Modern AI workloads, particularly those driving generative and reasoning-based applications, are highly distributed and communication-intensive. Whether training large language models across multiple GPU nodes or delivering real-time inference under strict latency constraints, networking performance becomes as critical as compute and memory.

NVIDIA Spectrum-X is a purpose-built, end-to-end Ethernet networking platform optimized to maximize the performance and efficiency of NVIDIA GPUs in modern cloud environments. It includes:

  • Spectrum-4 Ethernet Switches: Offering up to 64 ports of 800GbE in a compact 2U form factor, Spectrum-4 delivers an industry-leading total throughput of 51.2 terabits per second (Tb/s). Designed for use across smart-leaf, spine, and super-spine layers, these switches are foundational for building scalable, high-performance AI network fabrics.

  • BlueField-3 SuperNICs: These advanced network accelerators provide up to 400GbE RoCE connectivity between GPU servers, enabling NVIDIA GPUDirect® RoCE to maximize AI workload efficiency. With support for Direct Data Placement (DDP), in-order packet delivery, and enhanced telemetry, they help ensure consistent low-latency, high-throughput performance across distributed AI applications.

Spectrum-X leverages Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) to improve bandwidth efficiency and enforce workload isolation. It incorporates RoCE Adaptive Routing for congestion avoidance, telemetry-driven Congestion Control, and Performance Isolation to maintain consistent behavior across tenants. With the proper settings in place, Spectrum-X ensures scalable, high-performance AI networking.

Why Automatic Networking Management Is Key to AI Multi-Tenancy In multi-tenant AI clouds, ensuring network isolation and maintaining consistent performance across tenants are critical. Spectrum-X delivers the foundational capabilities needed to support secure and efficient multi-tenant operations.

  • Traffic Isolation: Enforces strict separation between tenant traffic, preventing noisy neighbors and maximizing security in multi-tenant AI environments.

  • Quality of Service (QoS) and Fair Scheduling: Ensures each tenant receives consistent network performance, critical for maintaining service level agreements (SLAs).

However, unlocking these capabilities requires precise switch configurations, as detailed in the “Spectrum-X Compute Network Fabric Optimized for AI Cloud Deployment Guide Featuring NVD HGX Systems.” In a multi-tenant cloud environment, these configurations must adapt in real time as GPU resources are allocated and reallocated across tenants. Without dynamic, automated adjustment of switch settings to match changing tenant topologies, the provisioning and scaling of resources becomes manual, error-prone, and counter to the on-demand experience users expect from a modern cloud.

Zadara’s Advantage: Extending Capabilities to Support Spectrum-X Zadara’s platform architecture has long supported secure, multi-tenant cloud operations, and we’ve extended it with new capabilities to take full advantage of Spectrum-X’s high-performance and traffic isolation features. By automating the provisioning and orchestration of GPU-to-GPU networking based on compute resource placement, Zadara simplifies Spectrum-X adoption for NVIDIA Cloud Providers (NCPs) building modern, multi-tenant AI clouds.

  1. Network-Aware Multi-Tenant Design: Zadara’s platform integrates software-defined networking (SDN) and tenant isolation. It automatically allocates VRFs and assigns switch ports to the correct VRF, ensuring compatibility with Spectrum-X’s granular control capabilities. This design supports secure, automated, and efficient multi-tenant operations across the GPU networking fabric.

  2. GPU-Net: Zadara’s Transparent, Policy-Based GPU Networking Zadara automates the deployment and lifecycle management of GPU-Net over a backend switching fabric compatible with Spectrum-4 and aligned with NVIDIA’s reference architecture. GPU-Nets extend Zadara’s Virtual Private Cloud (VPC) model, providing dedicated east-west GPU communication paths between virtual machines. These paths are provisioned automatically, with no user configuration required. Only VMs within the same VPC can exchange GPU traffic, ensuring high-throughput, low-latency, and secure communication. The switching fabric is dynamically programmed via APIs, allowing Zadara to reflect tenant topology changes in real time.

  3. Consistent Low Latency at Scale: Zadara’s orchestration intelligently aligns GPU-Net configuration and virtual machine placement with Spectrum-X’s rail-group topology, following NVIDIA’s best practices for deterministic performance. This ensures consistent, low-latency east-west GPU communication even as tenants scale up or down, preserving throughput and avoiding congestion.

Flexible GPU Infrastructure: Zadara supports dynamic GPU resource allocation to different tenants with zero work required from the NCP. NCPs can easily add or remove GPU nodes from the cloud, while Zadara’s control plane automatically handles the provisioning of GPU resources to tenants, dedication of nodes and resources, and enforcement of quotas. More importantly for this blog’s topic, GPU traffic is routed efficiently through the network while maintaining optimal resource utilization.

 
NVIDIA Zadara Diagram

Conclusion:

High-performance networking is foundational to scalable, secure, and predictable multi-tenant AI cloud infrastructure. NVIDIA Spectrum-X sets the standard for GPU data movement, but comes with a significant orchestration and management overhead for the multi-tenant use-case. Zadara delivers the software platform and orchestration layer that enables these capabilities within NCP environments in a real-world multi-tenant setting.

In future posts, we’ll explore how Zadara extends support for NVIDIA technologies such as BlueField-3 and others, including considerations around virtualization, isolation, and security for AI cloud infrastructure.

Picture of Simon Grinberg

Simon Grinberg

Simon Grinberg is a technology and product leader focused on cloud infrastructure and virtualization. He took his first steps in the virtualization and cloud space at Qumranet, the company behind KVM, and continued as a product manager for Red Hat Enterprise Virtualization after its acquisition. He later joined Stratoscale, and went on to found Neokarm to pursue a new vision for cloud-native virtualization. Simon joined Zadara through its acquisition of Neokarm and is passionate about building scalable systems that solve real-world problems.

Share This Post

More To Explore