Executive summary of tested configurations, benchmarks, and optimization strategies for LLM inference on NVIDIA HGX B200 GPUs with vLLM.
This cookbook provides tested configurations, comprehensive benchmarks, and optimization strategies for deploying large language models on the NVIDIA HGX B200 (8x NVIDIA HGX B200 GPUs, 1.43 TB HBM3e total) using vLLM. Five frontier open-source models were benchmarked across concurrency levels 1-1024, with additional experiments covering NVIDIA Dynamo disaggregated serving, NVFP4 quantization, and goodput analysis.
| Model | Total Params | Active Params | Architecture | Attention | TP | Quantization |
|---|---|---|---|---|---|---|
| Nemotron Nano 30B | 30B | 3B | Hybrid Mamba-Transformer MoE | Mamba (SSM) | 2/1 | FP8, NVFP4 |
| Nemotron Super 49B | 49B | 49B | Dense Transformer | Standard MHA | 1 | FP8, bf16 |
| MiniMax M2.5 229B | 229B | 10B | MoE | Lightning Attention | 4 | FP8 |
| GLM-5 744B | 744B | ~40B | MoE | Differential Sparse (DSA) | 8 | FP8 |
| DeepSeek V3.2 685B | 685B | ~37B | MoE | Multi-Latent (MLA) | 8 | FP8 |
vllm bench serve against OpenAI-compatible API| Model | Active Params | Peak tok/s | tok/s/GPU | Saturation Point |
|---|---|---|---|---|
| Nemotron Nano FP8 | 3B | 18,829 | 9,415 | ~512 |
| Nemotron Nano NVFP4 | 3B | 15,575 | 15,575 | ~512 |
| MiniMax M2.5 | 10B | 8,838 | 2,210 | ~512 |
| DeepSeek V3.2 | 37B | 4,370 | 546 | ~512 |
| Nemotron Super 49B | 49B | 3,816 | 3,816 | ~64 |
| GLM-5 | 40B | 2,132 | 267 | ~128 |
| Model | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|
| Nemotron Nano FP8 | 206 | 7.86 | 40.72 |
| Nemotron Nano NVFP4 | 280 | 6.77 | 38.31 |
| MiniMax M2.5 | 91 | 18.82 | 22.04 |
| Nemotron Super 49B | 172 | 11.57 | 12.09 |
| DeepSeek V3.2 | 931 | 21.34 | 22.32 |
| GLM-5 | 1,341 | 33.56 | 31.86 |
| Metric | FP8 (TP=2) | NVFP4 (TP=1) |
|---|---|---|
| Peak tok/s | 18,829 | 15,575 |
| tok/s/GPU | 9,415 | 15,575 (1.65x) |
| Max instances per node | 4 | 8 |
| Aggregate node tok/s | ~75,000 | ~124,000 |
NVFP4 is the NVIDIA HGX B200's standout feature: 1.65x more cost-efficient, and 8 NVFP4 instances per node deliver ~124,000 tok/s aggregate vs ~75,000 from 4 FP8 instances.
| Concurrency | Goodput (req/s) | Output tok/s |
|---|---|---|
| 32 | 6.10 | 3,331 |
| 64 | 13.29 | 8,027 |
| 128 | 6.20 | 11,343 |
| 256 | 3.65 | 15,237 |
| 512 | 2.08 | 18,535 |
Goodput-optimal concurrency (c=64) is 8x lower than throughput-optimal (c=512+). For interactive workloads with SLA requirements, c=32-64 is the sweet spot.
Tested with NVIDIA Dynamo v0.9.1 on the single-node 8-GPU setup. Disaggregation splits GPUs into dedicated prefill and decode pools.
| Concurrency | Aggregated tok/s | Disaggregated tok/s | Agg TTFT | Disagg TTFT |
|---|---|---|---|---|
| 1 | 70 | 85 | 2,264 ms | 691 ms |
| 32 | 1,210 | 772 | 1,685 ms | 7,800 ms |
| 128 | 3,194 | 822 | 1,647 ms | 58,094 ms |
Finding: Disaggregation wins at low concurrency (c=1): 3.3x TTFT improvement and 21% better throughput. At high concurrency, aggregated mode dominates due to memory constraints at TP=4 per worker.
Aggregated TP=8 dominates all disaggregated configurations: 4,900 tok/s (c=128) vs best disaggregated at 984 tok/s. NIXL transfer overhead (~800ms) exceeds prefill cost when the model fits on 1 GPU with 80 GB spare for KV cache.
--no-enable-flashinfer-autotune: sm_100 autotuner segfaults; uses pre-compiled cubins (benchmarked impact: <6% variance, production-viable)--connector none for aggregated mode or bf16 models for disaggregated: Mamba hybrid KV cache incompatible with NIXL--no-enable-prefix-caching: incompatible with hybrid KV cacheActive parameter count predicts per-GPU throughput. 3B active (Nemotron Nano) = 9,415 tok/s/GPU; 10B (MiniMax) = 2,210; 37B (DeepSeek) = 546; 40B (GLM-5) = 267.
NVFP4 is the NVIDIA HGX B200's killer feature. 1.65x cost-efficiency over FP8, with ~124,000 tok/s aggregate per node from 8 single-GPU instances.
KV cache architecture determines scaling. Mamba, MLA, and Lightning Attention scale to ~512 concurrent. DSA (GLM-5) saturates at ~128 due to active parameter pressure on KV cache.
Dynamo disaggregation is situational on single-node. Clear wins at low concurrency and high input ratios. Aggregated mode scales better at high concurrency due to per-worker memory constraints. Multi-node is where disaggregation shines.
Goodput-optimal != throughput-optimal. Operators should target c=32-64 for interactive SLAs, c=256-512 for batch processing.
All 5 models scale linearly to saturation with minimal failures on NVIDIA HGX B200, demonstrating production readiness of the hardware + vLLM stack.