Inference Cookbook for CUDA

Updated on 13 March, 2026

Deploy and optimize large language models on NVIDIA HGX B200 GPUs with vLLM. This cookbook provides tested configurations, benchmark data, and optimization strategies for serving frontier models on NVIDIA HGX B200 infrastructure.


Hardware

Specification Value
GPU 8x NVIDIA HGX B200
VRAM 179 GB HBM3e per GPU (1.43 TB total)
Memory Bandwidth 8.0 TB/s per GPU
Compute (FP8) ~4.5 PFLOPS per GPU
Interconnect NVSwitch 5.0 + NVLink 5.0 (1.8 TB/s bidirectional)
TDP 1000W per GPU
Driver 580.105.08
CUDA 13.0

Models Benchmarked

This cookbook benchmarks five models that represent the architectural diversity of frontier open-source LLMs in early 2026. Each model uses a different attention mechanism and stresses the NVIDIA HGX B200's hardware subsystems differently.

Model Params (Active) Attention TP GPU Story
Nemotron 3 Nano 30B 30B (3B) Mamba hybrid 2 (FP8) / 1 (NVFP4) Single-GPU efficiency, NVFP4
MiniMax M2.5 229B 229B (10B) Lightning Attention 4 Mid-scale, linear attention
GLM-5 744B 744B (40B) DSA 8 Frontier open-source
Nemotron Super 49B 49B (49B) Standard MHA 1 Dense transformer, Dynamo disagg testing
DeepSeek V3.2 685B 685B (37B) MLA 8 Frontier MoE, compressed KV

Why Mostly MoE?

Four of the five models use Mixture-of-Experts (MoE) routing: this reflects the current state of open-source LLMs. Llama 3.1 405B (July 2024) remains the only large dense open-source model; every successor (Llama 4, DeepSeek V3, GLM-5, Qwen3, Kimi K2.5) uses MoE. The exception is Nemotron Super 49B, a dense NAS-optimized transformer included for its compatibility with NVIDIA Dynamo's disaggregated serving.

What varies across our five models is more interesting than the dense/MoE distinction: five different attention mechanisms (Mamba SSM, standard MHA, Lightning linear attention, Differential Sparse Attention, Multi-Latent Attention), five different KV cache profiles, and active parameter counts spanning 3B to 49B.

What's Covered

  • Getting Started: Environment setup, driver verification, vLLM installation
  • Model Guides: Per-model configurations, memory requirements, deployment commands
  • Optimization: FP8/NVFP4 quantization, KV cache tuning, concurrency scaling
  • Benchmarks: Throughput, latency, goodput, and scaling curves
  • NVIDIA Dynamo: Disaggregated serving with prefill/decode separation

Quick Start

console
# Verify GPUs
$ nvidia-smi

# Serve Nemotron Nano 30B (smallest model, fastest to start)
$ pip install vllm
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

# Test it
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
       "messages": [{"role": "user", "content": "Hello"}],
       "max_tokens": 64}'
Common issues encountered when running vLLM on NVIDIA HGX B200 GPUs and their solutions.
Guidelines for deploying vLLM on NVIDIA HGX B200 instances in production.