Inference Cookbook for CUDA

Updated on 13 March, 2026

Deploy and optimize large language models on NVIDIA HGX B200 GPUs with vLLM. This cookbook provides tested configurations, benchmark data, and optimization strategies for serving frontier models on NVIDIA HGX B200 infrastructure.

Hardware

Specification	Value
GPU	8x NVIDIA HGX B200
VRAM	179 GB HBM3e per GPU (1.43 TB total)
Memory Bandwidth	8.0 TB/s per GPU
Compute (FP8)	~4.5 PFLOPS per GPU
Interconnect	NVSwitch 5.0 + NVLink 5.0 (1.8 TB/s bidirectional)
TDP	1000W per GPU
Driver	580.105.08
CUDA	13.0

Models Benchmarked

This cookbook benchmarks five models that represent the architectural diversity of frontier open-source LLMs in early 2026. Each model uses a different attention mechanism and stresses the NVIDIA HGX B200's hardware subsystems differently.

Model	Params (Active)	Attention	TP	GPU Story
Nemotron 3 Nano 30B	30B (3B)	Mamba hybrid	2 (FP8) / 1 (NVFP4)	Single-GPU efficiency, NVFP4
MiniMax M2.5 229B	229B (10B)	Lightning Attention	4	Mid-scale, linear attention
GLM-5 744B	744B (40B)	DSA	8	Frontier open-source
Nemotron Super 49B	49B (49B)	Standard MHA	1	Dense transformer, Dynamo disagg testing
DeepSeek V3.2 685B	685B (37B)	MLA	8	Frontier MoE, compressed KV

Why Mostly MoE?

Four of the five models use Mixture-of-Experts (MoE) routing: this reflects the current state of open-source LLMs. Llama 3.1 405B (July 2024) remains the only large dense open-source model; every successor (Llama 4, DeepSeek V3, GLM-5, Qwen3, Kimi K2.5) uses MoE. The exception is Nemotron Super 49B, a dense NAS-optimized transformer included for its compatibility with NVIDIA Dynamo's disaggregated serving.

What varies across our five models is more interesting than the dense/MoE distinction: five different attention mechanisms (Mamba SSM, standard MHA, Lightning linear attention, Differential Sparse Attention, Multi-Latent Attention), five different KV cache profiles, and active parameter counts spanning 3B to 49B.

What's Covered

Getting Started: Environment setup, driver verification, vLLM installation
Model Guides: Per-model configurations, memory requirements, deployment commands
Optimization: FP8/NVFP4 quantization, KV cache tuning, concurrency scaling
Benchmarks: Throughput, latency, goodput, and scaling curves
NVIDIA Dynamo: Disaggregated serving with prefill/decode separation

Quick Start

                            console
                            
                        
# Verify GPUs
$ nvidia-smi

# Serve Nemotron Nano 30B (smallest model, fastest to start)
$ pip install vllm
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

# Test it
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
       "messages": [{"role": "user", "content": "Hello"}],
       "max_tokens": 64}'

Getting Started

Get started with NVIDIA HGX B200 GPUs, including hardware overview, environment setup, and first model deployment.

NVIDIA HGX B200 Hardware Overview

The NVIDIA HGX B200 is a Blackwell-architecture data center GPU designed for large-scale AI inference and training. NVIDIA HGX B200 servers provide 8 GPUs per node connected via NVSwitch 5.0.

First Deployment

Deploy your first model on NVIDIA HGX B200 GPUs and verify the setup end-to-end.

Environment Setup

Set up an NVIDIA HGX B200 instance for LLM inference with vLLM.

Model Guides

Deploy leading AI models including Nemotron, DeepSeek, GLM, and MiniMax on NVIDIA HGX B200 GPUs with optimized inference configurations.

Nemotron 3 Nano 30B

Deploy NVIDIA's Nemotron 3 Nano on NVIDIA HGX B200 GPUs. This hybrid Mamba-Transformer model delivers high throughput with only 3B active parameters per token.

Llama-3.3 Nemotron Super 49B

Deploy NVIDIA's Nemotron Super 49B on NVIDIA HGX B200 GPUs. This dense transformer model is based on Llama 3.3 with NAS-optimized architecture, delivering strong reasoning performance.

DeepSeek V3.2

Deploy DeepSeek's V3.2 on NVIDIA HGX B200 GPUs. This MoE model uses Multi-Latent Attention (MLA) for compressed KV caching, delivering strong reasoning performance at 685B parameters.

GLM-5

Deploy THUDM/Zhipu's GLM-5 on NVIDIA HGX B200 GPUs. This large MoE model introduces Differential Sparse Attention for efficient inference at 744B total parameters.

MiniMax M2.5

Deploy MiniMax's M2.5 on NVIDIA HGX B200 GPUs. This MoE model combines Lightning Attention with traditional SoftMax attention for efficient long-context inference.

Dynamo

Explore NVIDIA Dynamo’s architecture for disaggregated LLM inference, including routing, KV cache tiering, and optimized deployment with vLLM.

NVIDIA Dynamo

NVIDIA Dynamo is an open-source inference framework that adds disaggregated serving, intelligent routing, and tiered KV caching on top of vLLM. Version 0.9.1 supports single-node deployment with in-memory service discovery (--store-kv mem): no external infrastructure (etcd, NATS) required.

NVIDIA HGX B200 vLLM Optimization Cookbook: Executive Summary

Executive summary of tested configurations, benchmarks, and optimization strategies for LLM inference on NVIDIA HGX B200 GPUs with vLLM.

Optimization

Optimize LLM inference on NVIDIA HGX B200 GPUs with KV cache management, quantization, kernel tuning, and concurrency optimization.

KV Cache Optimization

The KV cache is often the dominant consumer of GPU VRAM during inference. With 179 GB per NVIDIA HGX B200 GPU, efficient cache management determines how many concurrent requests you can serve and at what context length.

KV Cache Offloading

Extend effective memory by offloading KV cache to CPU memory when GPU HBM is insufficient.

FP8 and NVFP4 Quantization

The NVIDIA HGX B200 GPU supports both FP8 (8-bit floating point) and NVFP4 (4-bit floating point) quantization natively in hardware. These formats reduce memory usage and increase throughput compared to BF16, with minimal quality loss.

Kernel Backends: FlashInfer and DeepGEMM

vLLM on the NVIDIA HGX B200 uses specialized CUDA kernel backends for attention and GEMM operations. Understanding these backends helps diagnose startup issues, explain warmup times, and tune performance.

Concurrency Tuning

Maximize throughput by tuning vLLM for concurrent request loads on NVIDIA HGX B200 GPUs.

Benchmarks

Benchmark methodology and performance results for LLM inference workloads on NVIDIA HGX B200 GPUs.

Benchmark Methodology

How we measure LLM inference performance on NVIDIA HGX B200 GPUs.

Benchmark Results Overview

Consolidated benchmark results for all five models on NVIDIA HGX B200 GPUs.

Troubleshooting

Common issues encountered when running vLLM on NVIDIA HGX B200 GPUs and their solutions.

Production Deployment

Guidelines for deploying vLLM on NVIDIA HGX B200 instances in production.