RIGHT-SIZING GPU & COMPUTE INFRASTRUCTURE FOR AI WORKLOADS

How Zadara Optimizes AI Performance for a Well Sized Infrastructure and Why Zadara is the Ideal AI Factory Platform for Both Training and Inference

AI infrastructure projects fail less often because of wrong technology choices, and more often because of the wrong sizing choices. Undersize, and your LLM grinds to a halt under real user load. Oversize, and you have committed expensive GPU hardware that sits idle 80% of the time.

This article walks you through a practical, step-by-step framework for sizing GPU and compute infrastructure for AI workloads, from initial requirements gathering through to a deployable architecture recommendation. It also elaborates on how using Zadara’ as the foundation optimizes the overall AI performance once the proper sizing decisions were made. Moreover, it clarifies how Zadara flexibility enables inference, sovereign AI, and large scale batch training options with superior performance for a well sized deployment.

Whether you are planning a private AI deployment for data-sovereign workloads, a GPU-accelerated analytics pipeline, or a multi-tenant inference platform, the methodology is the same. The hardware choices Zadara offers or otherwise supports, from focused inference nodes built on NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs (96 GB VRAM per GPU, 4 GPUs per node) up to large-scale HGX clusters with NVIDIA Spectrum-X networking or InfiniBand, cover the full spectrum of use cases discussed here.

Zadara’s position in the AI space and its AI-optimized solution was highlighted at NVIDIA GTC 2026, where Zadara announced alignment with NVIDIA’s Software Reference Guide for multi-tenant AI cloud infrastructure and presented its Neutral AI Factory architecture. NVIDIA’s GTC blog also featured Zadara in the context of a strategic partnership with DDN to deliver high-performance AI infrastructure for sovereign clouds and multi-tenant AI factories, built on NVIDIA reference architectures.

Why Sizing Matters More Than Hardware Choice

The NVIDIA GPU lineup is impressive and continues to evolve rapidly, from workload-focused professional GPUs to the latest datacenter accelerators like the NVIDIA H200 GPU and NVIDIA B300 GPU. Larger cluster configurations with NVIDIA Spectrum-X Ethernet or InfiniBand networking can scale to thousands of GPUs. But before you decide which GPU to use, you need to answer a more fundamental question: What type of workload are you running?

AI workloads on GPUs fall into two fundamentally different categories, and they have different optimization targets:

	Batch / Training	Online Inference
Goal	Maximum throughput	Minimum latency & time to first token
GPU utilization	Up to 90-100%	Up to 30-60%
User-facing?	No	Yes
Time-sensitive?	No	Yes
Example	Fine-tuning, embeddings, nightly analytics	Chatbots, internal copilots, API services

Getting this distinction right is the first step in any sizing conversation. Also, more is not always better or more cost effective. In our experience, many customers come in requesting the fastest GPU available when their actual workload is batch-oriented, and a right-sized GPU configuration operated by Zadara and matched to the workload type will be more cost effective and operationally simple.

The Three Pillars of GPU Sizing

Once you know your workload type, there are three technical dimensions that determine your hardware requirements. Think of them as layers that stack on top of each other.

Pillar 1: Model Weights, Your VRAM Baseline

Every AI model has parameters. Each parameter requires memory to load onto the GPU. The number of bytes each parameter occupies affects the overall precision level the model provides. As a result, the amount of required memory depends on the numerical precision (format) you choose:

VRAM for model weights = parameter count x bytes per parameter

Precision Format	Bytes per Parameter	LLaMA-3 70B Example	LLaMA-3 70B on 4x NVIDIA RTX PRO 6000 (384 GB)
FP32 (full precision)	4 bytes	280 GB	Does not fit single node
FP16 / BF16 (standard)	2 bytes	140 GB	Fits with KV-cache headroom
INT8 (quantized)	1 byte	70 GB	Fits on single GPU (96 GB)
INT4 (aggressive quant)	0.5 bytes	35 GB	Fits on single GPU with ample room

Table 1 – Precision formats and the overall size of the AI model.

A practical rule of thumb that is often used, sets the following

 model size in billions x 2 = minimum VRAM in GB at FP16.

Zadara’s GPU Cloud supports a range of NVIDIA GPU configurations across its hundreds of edge clouds globally. Inference-optimized nodes use the NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7 VRAM per GPU, 4 GPUs per node for a total of 384 GB), suited for focused workloads and sovereign edge deployments. For larger models and high-concurrency requirements, Zadara supports HGX-class AI Factory datacenter accelerators, including the NVIDIA H200 GPU (141 GB HBM3e, 8 GPUs per node for 1,128 GB total) and the NVIDIA B300 GPU (288 GB HBM3e, 8 GPUs per node for 2,304 GB total). For models that exceed single-node VRAM, multi-node configurations with NVIDIA Spectrum-X or InfiniBand clusters extend this ceiling to any scale required.

Sizing note: For production workloads in regulated industries (finance, healthcare, government) we recommend FP16 or INT8 rather than INT4. The quality difference is measurable, and for sovereign AI deployments, reliability matters more than squeezing the last GPU out of a configuration.

Pillar 2: Memory Bandwidth, The Real Performance Driver

Here is a counter-intuitive truth about LLM inference: bottlenecks lie rarely in the computing resources (FLOPS). More often than not, it is memory bandwidth, the speed at which model weights can be read from VRAM during token generation, that affects performance.

During the decode phase (generating each output token), the model must read its full set of weights for every token produced. This is a memory-bandwidth-bound operation, not a compute-bound one.

Approximate tokens/second = GPU Memory Bandwidth (GB/s) / Model Size (GB)

For a 70B FP16 model (140 GB) on a single NVIDIA H200 SXM GPU with 4,800 GB/s bandwidth:

4,800 GB/s / 140 GB = ~34 tokens/s per user (single GPU, no batching)

With tensor parallelism across four NVIDIA H200 GPUs (combined ~19,200 GB/s) and optimized serving via vLLM:

19,200 GB/s / 140 GB = ~137 tokens/s aggregate throughput before batching overhead

On inference-optimized nodes with NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs (each providing approximately 1,792 GB/s GDDR7 bandwidth, 96 GB VRAM), a 70B INT8 model (70 GB) fits on a single GPU. The bandwidth calculation for one card is therefore 1,792 / 70, which makes approximately 25 tokens per second. With all four GPUs in tensor parallelism across the node, aggregate bandwidth reaches approximately 7,168 GB/s, delivering substantially higher throughput with batching.

This is why multi-GPU configurations using Zadara’s AI Cloud, especially those utilizing NVIDIA Spectrum-X high-bandwidth networking or InfiniBand between nodes, unlock disproportionate performance gains for large model inference.

Zadara’s Spectrum-X implementation is built on NVIDIA Spectrum-4 Ethernet switches delivering up to 51.2 Tb/s of aggregate switching capacity, paired with BlueField-3 SuperNICs providing up to 400 GbE RoCE connectivity between GPU servers.

What makes this operationally unique is Zadara’s GPU-Net technology. This is an orchestration layer that automatically provisions dedicated east-west scale-out GPU communication paths between virtual machines, aligns configurations with Spectrum-X’s rail optimized topology following NVIDIA’s reference architecture, and handles multi-tenant isolation through automated VRF allocation. Customers get best-in–class GPU interconnect performance on Ethernet, without the operational complexity of managing it manually.

Pillar 3: KV-Cache, The Hidden VRAM Consumer

When a language model generates tokens, it generates one token at a time. It does this by keeping its “attention” to the full context of tokens that were generated to that point. This corpus of tokens grows as the inference progresses. In order to avoid generation of the same tokens over and over again, the language model internally computes and stores Key and Value vectors for every token it has seen so far. This is the KV-Cache, and without it, the model would have to recompute the entire context on every generation step, making inference impractically slow.

The KV-Cache is essential. It is also invisible if you only carry out a sizing exercise for model weights, and it can easily double or triple your total VRAM requirements under real workloads.

The cache grows with three factors: model architecture (number of layers, attention heads, hidden dimension), context length, and concurrent users. Modern large language models like LLaMA-3 70B use Grouped Query Attention (GQA) with 8 KV heads instead of the full 64, which significantly reduces per-user KV-cache size compared to older multi-head attention architectures. A practical estimate for LLaMA-3 70B with GQA at 4K context is approximately 0.5 to 1 GB of KV-Cache per concurrent user.

Concurrent Users	KV-Cache (70B GQA, 4K ctx)	Model Weights (INT8)	Total VRAM
5	~4 GB	70 GB	~74 GB
20	~16 GB	70 GB	~86 GB
50	~40 GB	70 GB	~110 GB
100	~80 GB	70 GB	~150 GB

Table 2 – KV cache sizes

Note: For older architectures without GQA (such as the original GPT-style full multi-head attention), KV-cache per user can be 3 to 4 times higher. Always check the specific model’s attention configuration when sizing.

This is where vLLM with Paged Attention becomes a critical architectural choice rather than an optional optimization. By managing KV-Cache as dynamic memory pages, in the same way that virtual memory works in an operating system, vLLM reduces VRAM waste from pre-allocated but unused cache blocks. In practice, this multiplies the number of concurrent users a given hardware configuration can serve by 3 to 4 times.

On Zadara’s zCompute Cloud, we deploy vLLM as a containerized workload on Kubernetes using the NVIDIA GPU Operator for device plugin management.This gives you the scheduling flexibility to right-size at runtime, not just at procurement time.

A Practical Sizing Framework

With the three pillars understood, here is the five-step framework we use when sizing GPU infrastructure for customers on Zadara:

Step 1: Clarify Workload Requirements

Before opening a hardware catalogue, answer these questions: Which model (name and parameter count)? Online inference or batch processing? Peak concurrent users (for online) or batch volume/hour (for batch)? Maximum context length (2K / 4K / 8K / 128K tokens)? Quality requirements (FP16 production-grade or INT8/INT4 acceptable)? Data sovereignty constraints (on-premises, specific geography, air-gapped)?

The last point is increasingly relevant for enterprise customers. Zadara’s Sovereign AI Edge Cloud, deployed across hundreds of edge locations globally (on-premises or in partner datacenters), is specifically designed for organizations where data cannot leave a specific region or facility.

Step 2: Calculate Total VRAM

Total VRAM = Model Weights + KV-Cache + Framework Overhead

Model Weights: parameters (B) x format multiplier (GB). KV-Cache: depends on model architecture (check GQA head count), context length, and concurrent users. Scale proportionally for longer contexts. Framework overhead: 2 to 4 GB (CUDA runtime, vLLM, Kubernetes overhead).

Step 3: Select GPU Configuration

Map your total VRAM requirement to a GPU class and configuration. Zadara supports a range of NVIDIA GPU tiers. The right match depends on your model size, concurrency, and performance requirements:

GPU Class	Zadara Configuration	GPUs per Node	VRAM per Node	Recommended For
Inference-Optimized	NVIDIA RTX PRO 6000 Blackwell Server Edition	4	384 GB (4 x 96 GB GDDR7)	Real-time inference, sovereign edge, models up to 70B INT8 on single GPU, up to 70B FP16 across node
Datacenter Compute	HGX server node with NVIDIA H200 GPU	8	1,128 GB (8 x 141 GB HBM3e)	70B+ models, high-concurrency inference, fine-tuning, 4.8 TB/s bandwidth per GPU
Datacenter Compute (Next-Gen)	HGX server node with NVIDIA B300 GPU	8	2,304 GB (8 x 288 GB HBM3e)	400B+ models, frontier training, high-concurrency with large KV-cache, 8 TB/s bandwidth per GPU
Large-Scale Cluster + Spectrum-X	Multi-node HGX + NVIDIA BlueField-3 DPU	Scalable	Scalable	Foundation model training, 400B+ inference, sovereign AI factories

Contact the Zadara team to confirm currently available GPU configurations. The portfolio evolves continuously with NVIDIA’s roadmap.

Step 4: Validate Performance

Cross-check that your configuration meets latency and throughput requirements:

Estimated tokens/s = (Combined Bandwidth GB/s) / (Model Size GB)

For online inference targets: Time to First Token (TTFT) should be under 2 seconds. Perceived generation speed should exceed 30 tokens/s per user.

If the estimate falls short, either add GPUs or consider quantization, but document the quality trade-off for the customer.

Step 5: Add Redundancy and Scale-Out

A single-node GPU deployment is not production-ready. For any customer-facing workload:

High availability: Minimum two independent GPU nodes behind a load balancer. Kubernetes scheduling: NVIDIA GPU Operator on zCompute Kubernetes enables node-level GPU scheduling and health monitoring. Horizontal scaling: Zadara’s autoscaling groups allow GPU node count to flex with demand.

Storage for model weights and datasets: Zadara’s architecture provides two complementary storage tiers. Elastic block volumes are always provided by Zadara (zStorage VPSA), delivering consistent low-latency block storage to GPU nodes via iSCSI to Kubernetes persistent volumes. This is the default for model weight storage and eliminates cold-start performance degradation. For workloads that require high-throughput shared file storage (large training datasets, checkpoint management across nodes, or multi-tenant model registries), Zadara integrates either DDN EXAScaler or Zadara’s own file system, depending on application requirements.The selection between these shared storage options is driven by the specific performance profile, scale, and data access patterns of the workload.

Worked Example: Sovereign AI Platform for a Financial Services Customer

To make this concrete, here is a sizing exercise representative of deployments we have architected for regulated-industry customers.

Customer profile: Regional financial services institution. Requires a private LLM for internal compliance document analysis and structured Q&A. Data must remain within their jurisdiction. 40 peak concurrent analysts. Context length up to 8,192 tokens (long regulatory documents). Response time: Time to First Token under 2 seconds.

Model selection: LLaMA-3 70B in INT8 (quality sufficient for structured document Q&A, sovereign deployment on-premises).

Step 2: VRAM Calculation

Model weights (INT8): 70 GB. KV-Cache (40 users, 8K context, GQA with 8 KV heads): approximately 25 GB. At 8K context the per-user cache is roughly double the 4K estimate, so 40 users at ~0.6 GB each yields ~25 GB. Framework overhead: 4 GB.

Total VRAM required: ~99 GB

Step 3: GPU Selection

Required: ~99 GB VRAM.

Option A: NVIDIA RTX PRO 6000 Blackwell Server Edition node (4 x 96 GB = 384 GB total). The 70 GB INT8 model fits on a single GPU with 26 GB headroom for KV-cache on that card. With tensor parallelism across 2 GPUs, the model is sharded at 35 GB each, leaving ~61 GB per GPU for KV-cache and framework overhead. This is more than sufficient for 40 concurrent users at 8K context.

Option B: HGX server node with NVIDIA H200 GPU (8 x 141 GB = 1,128 GB total). Recommended for future growth, higher concurrency headroom, and if the customer plans to expand to larger models or longer context windows. With tensor parallelism (tp=2), each GPU shard is 35 GB with over 100 GB headroom per GPU.

Step 4: Performance Validation (NVIDIA RTX PRO 6000 Blackwell, tp=2)

Combined bandwidth: 2 x 1,792 GB/s = 3,584 GB/s
Model sharded: 70 GB / 2 = 35 GB per card
Tokens/s estimate: 3,584 / 70 = ~51 tokens/s aggregate throughput
Per user (40 concurrent with batching): ~40-50 tokens/s
TTFT at this load: ~1.2s (within target)

Step 5: Full Architecture

Compute: 2 identical NVIDIA RTX PRO 6000 Blackwell Server Edition nodes on Zadara zCompute, active/active for high availability. Orchestration: Kubernetes on Zadara zCompute with NVIDIA GPU Operator. Serving: vLLM with Paged Attention and tensor parallelism. Block storage: Zadara zStorage VPSA for model weight storage (NVMe-backed, served via iSCSI to K8s persistent volumes). Shared storage: DDN EXAScaler or Zadara file system, selected based on dataset access patterns and throughput requirements. Networking: Isolated Zadara VPC with private endpoint, no internet egress. Monitoring: Prometheus + Grafana for GPU utilization, TTFT, tokens/s, VRAM consumption.

Result: 40 concurrent users, TTFT consistently under 1.5 seconds, full data sovereignty maintained within the customer’s jurisdiction, zero dependency on public cloud APIs.

When to Scale to Cluster Configurations with Spectrum-X Networking

Single-node GPU deployments handle the majority of enterprise inference workloads comfortably. There are, however, scenarios where multi-node GPU clusters with NVIDIA Spectrum-X networking become the right answer:

Foundation model fine-tuning: Training or fine-tuning models with 70B+ parameters requires sustained, compute-bound operations across many GPUs simultaneously. Network bandwidth between GPU nodes becomes the dominant constraint. Spectrum-X’s 51.2 Tb/s switching capacity and BlueField-3 400 GbE RoCE connectivity eliminate that bottleneck.

Very large models (400B+): Models like LLaMA 405B or proprietary large-scale architectures require hundreds of GB of VRAM and high GPU-to-GPU communication for tensor parallelism. Spectrum-X’s two-tier leaf/spine architecture supports up to 8,000 GPUs, so the network does not become a ceiling on cluster size.

High-concurrency inference at scale: When serving large numbers of concurrent users with models sharded across multiple nodes, the KV-cache access pattern generates substantial east-west traffic between GPUs. Every token generation step requires reading and updating cached key-value pairs, and when those caches are distributed across nodes, network latency and bandwidth directly affect Time to First Token and per-user throughput. Spectrum-X’s adaptive routing and RoCE congestion control keep this inter-node KV-cache traffic flowing predictably, even under heavy multi-user load, preventing the tail-latency spikes that degrade user experience at scale.

High-volume multi-tenant inference platforms: Serving hundreds of simultaneous users in completely isolated tenants requires both GPU scale and strict performance isolation. Zadara’s GPU-Net technology handles tenant isolation automatically. Each tenant’s GPU communication paths are provisioned without manual network configuration, and performance is guaranteed to remain consistent regardless of adjacent tenant activity.

For these use cases, the zCompute platform maintains the same management interface, the same VPC and networking model, and the same Kubernetes-native operations across single-node and cluster deployments. You scale up the hardware without scaling up operational complexity.

Zadara as the AI Factory Platform

Beyond individual GPU sizing decisions, Zadara’s broader value is enabling what NVIDIA calls an AI Factory: a scalable, automated infrastructure for developing, training, deploying, and maintaining AI models in a repeatable, production-grade way.

Zadara provides the software platform and orchestration layer that makes this possible: zCompute for elastic GPU compute, a tiered storage architecture combining Zadara zStorage VPSA for elastic block volumes with DDN EXAScaler or Zadara file system for high-throughput shared storage (selected based on application requirements), GPU-Net for automated Spectrum-X networking, and Sovereign AI Edge Cloud for deployments that require data to stay within a specific geography or facility. These components are integrated from day one, not assembled from separate vendor products.

Getting the Sizing Conversation Right

GPU infrastructure sizing is as much a discovery process as it is a technical exercise. The right questions to ask (workload type, model size, concurrency, context length, quality requirements, sovereignty constraints) are what separate a well-matched deployment from one that either underperforms or overinvests.

Zadara’s zCompute Cloud provides the flexibility to start right-sized and grow horizontally as workloads evolve. With GPU configurations spanning from inference-optimized NVIDIA RTX PRO 6000 Blackwell GPU nodes to Spectrum-X-networked HGX clusters (NVIDIA H200 GPU and NVIDIA B300 GPU), tiered enterprise storage tightly integrated, and sovereign deployment options across hundreds of edge locations, the infrastructure adapts to the workload rather than the other way around.

If you are planning a GPU infrastructure deployment and want to run through this sizing framework with your specific workload parameters, reach out to the Zadara team. We are happy to work through the numbers with you.

© 2026 Zadara All rights reserved. This document is provided for informational purposes only and does not constitute legal, financial, or professional advice. No part of this publication may be reproduced or distributed without prior written permission from Zadara. All product names, logos, and brands are the property of their respective owners, and any use of these trademarks is for identification purposes only and does not imply endorsement.

Post Views: 24

Marco Schneider

Marco is a staff solution architect at Zadara. He has over 25 years of experience in solution architecture, sales, and infrastructure management. He possesses a strong track record of success creating customer-centric solutions, drive revenue growth, and deliver cutting-edge cloud solutions.