Focus Mode

NVIDIA HGX B200 vLLM Optimization Cookbook: Executive Summary

Updated on 12 March, 2026

Executive summary of tested configurations, benchmarks, and optimization strategies for LLM inference on NVIDIA HGX B200 GPUs with vLLM.

Overview

This cookbook provides tested configurations, comprehensive benchmarks, and optimization strategies for deploying large language models on the NVIDIA HGX B200 (8x NVIDIA HGX B200 GPUs, 1.43 TB HBM3e total) using vLLM. Five frontier open-source models were benchmarked across concurrency levels 1-1024, with additional experiments covering NVIDIA Dynamo disaggregated serving, NVFP4 quantization, and goodput analysis.

Hardware

Platform: NVIDIA HGX B200, 8x GPUs, single node
VRAM: 179 GB HBM3e per GPU (1.43 TB total)
Interconnect: NVSwitch 5.0, NVLink 5.0 (1.8 TB/s bidirectional per GPU pair)
Stack: CUDA 13.0, Driver 580.105.08, vLLM 0.16.0, PyTorch 2.9.1

Models Benchmarked

Model	Total Params	Active Params	Architecture	Attention	TP	Quantization
Nemotron Nano 30B	30B	3B	Hybrid Mamba-Transformer MoE	Mamba (SSM)	2/1	FP8, NVFP4
Nemotron Super 49B	49B	49B	Dense Transformer	Standard MHA	1	FP8, bf16
MiniMax M2.5 229B	229B	10B	MoE	Lightning Attention	4	FP8
GLM-5 744B	744B	~40B	MoE	Differential Sparse (DSA)	8	FP8
DeepSeek V3.2 685B	685B	~37B	MoE	Multi-Latent (MLA)	8	FP8

Benchmark Methodology

Input/Output tokens: 2,048 / 512 (random synthetic)
Concurrency levels: 1, 8, 16, 32, 64, 128, 256, 512, 1024
GPU memory utilization: 90%
Tool: vllm bench serve against OpenAI-compatible API
Metrics: Output throughput (tok/s), TTFT, TPOT, ITL p99, saturation point

Peak Throughput Results

Model	Active Params	Peak tok/s	tok/s/GPU	Saturation Point
Nemotron Nano FP8	3B	18,829	9,415	~512
Nemotron Nano NVFP4	3B	15,575	15,575	~512
MiniMax M2.5	10B	8,838	2,210	~512
DeepSeek V3.2	37B	4,370	546	~512
Nemotron Super 49B	49B	3,816	3,816	~64
GLM-5	40B	2,132	267	~128

Latency at Concurrency = 32

Model	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
Nemotron Nano FP8	206	7.86	40.72
Nemotron Nano NVFP4	280	6.77	38.31
MiniMax M2.5	91	18.82	22.04
Nemotron Super 49B	172	11.57	12.09
DeepSeek V3.2	931	21.34	22.32
GLM-5	1,341	33.56	31.86

NVFP4 vs FP8 (Nemotron Nano)

Metric	FP8 (TP=2)	NVFP4 (TP=1)
Peak tok/s	18,829	15,575
tok/s/GPU	9,415	15,575 (1.65x)
Max instances per node	4	8
Aggregate node tok/s	~75,000	~124,000

NVFP4 is the NVIDIA HGX B200's standout feature: 1.65x more cost-efficient, and 8 NVFP4 instances per node deliver ~124,000 tok/s aggregate vs ~75,000 from 4 FP8 instances.

Goodput Analysis (Nemotron Nano FP8, SLO: TTFT < 500ms, TPOT < 50ms)

Concurrency	Goodput (req/s)	Output tok/s
32	6.10	3,331
64	13.29	8,027
128	6.20	11,343
256	3.65	15,237
512	2.08	18,535

Goodput-optimal concurrency (c=64) is 8x lower than throughput-optimal (c=512+). For interactive workloads with SLA requirements, c=32-64 is the sweet spot.

Dynamo Disaggregated Serving Experiments

Tested with NVIDIA Dynamo v0.9.1 on the single-node 8-GPU setup. Disaggregation splits GPUs into dedicated prefill and decode pools.

DeepSeek V3.2 (4 Prefill + 4 Decode GPUs)

Concurrency	Aggregated tok/s	Disaggregated tok/s	Agg TTFT	Disagg TTFT
1	70	85	2,264 ms	691 ms
32	1,210	772	1,685 ms	7,800 ms
128	3,194	822	1,647 ms	58,094 ms

Finding: Disaggregation wins at low concurrency (c=1): 3.3x TTFT improvement and 21% better throughput. At high concurrency, aggregated mode dominates due to memory constraints at TP=4 per worker.

Nemotron Super 49B bf16 (Aggregated TP=8 vs Disaggregated Splits)

Aggregated TP=8 dominates all disaggregated configurations: 4,900 tok/s (c=128) vs best disaggregated at 984 tok/s. NIXL transfer overhead (~800ms) exceeds prefill cost when the model fits on 1 GPU with 80 GB spare for KV cache.

Models That Cannot Be Disaggregated

GLM-5: OOM at TP=4 (weights 176 GB/GPU exceed 179 GB VRAM)
Nemotron Nano: Mamba hybrid KV cache incompatible with NIXL
MiniMax M2.5: vLLM 0.16.0 crash (n_group=0 incompatibility)
Nemotron Super 49B FP8: FlashInfer FP8 + NIXL layout assertion failure

When Disaggregation Helps

High input-to-output token ratios (RAG, summarization with 10K+ input)
SLA-driven workloads where TTFT matters more than throughput
Asymmetric GPU allocation (e.g., 6P+2D)
Multi-node deployments (where NIXL overhead is amortized)

Workarounds Required

--no-enable-flashinfer-autotune: sm_100 autotuner segfaults; uses pre-compiled cubins (benchmarked impact: <6% variance, production-viable)
--connector none for aggregated mode or bf16 models for disaggregated: Mamba hybrid KV cache incompatible with NIXL
--no-enable-prefix-caching: incompatible with hybrid KV cache

Key Findings

Active parameter count predicts per-GPU throughput. 3B active (Nemotron Nano) = 9,415 tok/s/GPU; 10B (MiniMax) = 2,210; 37B (DeepSeek) = 546; 40B (GLM-5) = 267.
NVFP4 is the NVIDIA HGX B200's killer feature. 1.65x cost-efficiency over FP8, with ~124,000 tok/s aggregate per node from 8 single-GPU instances.
KV cache architecture determines scaling. Mamba, MLA, and Lightning Attention scale to ~512 concurrent. DSA (GLM-5) saturates at ~128 due to active parameter pressure on KV cache.
Dynamo disaggregation is situational on single-node. Clear wins at low concurrency and high input ratios. Aggregated mode scales better at high concurrency due to per-worker memory constraints. Multi-node is where disaggregation shines.
Goodput-optimal != throughput-optimal. Operators should target c=32-64 for interactive SLAs, c=256-512 for batch processing.
All 5 models scale linearly to saturation with minimal failures on NVIDIA HGX B200, demonstrating production readiness of the hardware + vLLM stack.