NVIDIA HGX B200 vLLM Optimization Cookbook: Executive Summary

Updated on 12 March, 2026

Executive summary of tested configurations, benchmarks, and optimization strategies for LLM inference on NVIDIA HGX B200 GPUs with vLLM.


Overview

This cookbook provides tested configurations, comprehensive benchmarks, and optimization strategies for deploying large language models on the NVIDIA HGX B200 (8x NVIDIA HGX B200 GPUs, 1.43 TB HBM3e total) using vLLM. Five frontier open-source models were benchmarked across concurrency levels 1-1024, with additional experiments covering NVIDIA Dynamo disaggregated serving, NVFP4 quantization, and goodput analysis.

Hardware

  • Platform: NVIDIA HGX B200, 8x GPUs, single node
  • VRAM: 179 GB HBM3e per GPU (1.43 TB total)
  • Interconnect: NVSwitch 5.0, NVLink 5.0 (1.8 TB/s bidirectional per GPU pair)
  • Stack: CUDA 13.0, Driver 580.105.08, vLLM 0.16.0, PyTorch 2.9.1

Models Benchmarked

Model Total Params Active Params Architecture Attention TP Quantization
Nemotron Nano 30B 30B 3B Hybrid Mamba-Transformer MoE Mamba (SSM) 2/1 FP8, NVFP4
Nemotron Super 49B 49B 49B Dense Transformer Standard MHA 1 FP8, bf16
MiniMax M2.5 229B 229B 10B MoE Lightning Attention 4 FP8
GLM-5 744B 744B ~40B MoE Differential Sparse (DSA) 8 FP8
DeepSeek V3.2 685B 685B ~37B MoE Multi-Latent (MLA) 8 FP8

Benchmark Methodology

  • Input/Output tokens: 2,048 / 512 (random synthetic)
  • Concurrency levels: 1, 8, 16, 32, 64, 128, 256, 512, 1024
  • GPU memory utilization: 90%
  • Tool: vllm bench serve against OpenAI-compatible API
  • Metrics: Output throughput (tok/s), TTFT, TPOT, ITL p99, saturation point

Peak Throughput Results

Model Active Params Peak tok/s tok/s/GPU Saturation Point
Nemotron Nano FP8 3B 18,829 9,415 ~512
Nemotron Nano NVFP4 3B 15,575 15,575 ~512
MiniMax M2.5 10B 8,838 2,210 ~512
DeepSeek V3.2 37B 4,370 546 ~512
Nemotron Super 49B 49B 3,816 3,816 ~64
GLM-5 40B 2,132 267 ~128

Latency at Concurrency = 32

Model TTFT (ms) TPOT (ms) ITL p99 (ms)
Nemotron Nano FP8 206 7.86 40.72
Nemotron Nano NVFP4 280 6.77 38.31
MiniMax M2.5 91 18.82 22.04
Nemotron Super 49B 172 11.57 12.09
DeepSeek V3.2 931 21.34 22.32
GLM-5 1,341 33.56 31.86

NVFP4 vs FP8 (Nemotron Nano)

Metric FP8 (TP=2) NVFP4 (TP=1)
Peak tok/s 18,829 15,575
tok/s/GPU 9,415 15,575 (1.65x)
Max instances per node 4 8
Aggregate node tok/s ~75,000 ~124,000

NVFP4 is the NVIDIA HGX B200's standout feature: 1.65x more cost-efficient, and 8 NVFP4 instances per node deliver ~124,000 tok/s aggregate vs ~75,000 from 4 FP8 instances.

Goodput Analysis (Nemotron Nano FP8, SLO: TTFT < 500ms, TPOT < 50ms)

Concurrency Goodput (req/s) Output tok/s
32 6.10 3,331
64 13.29 8,027
128 6.20 11,343
256 3.65 15,237
512 2.08 18,535

Goodput-optimal concurrency (c=64) is 8x lower than throughput-optimal (c=512+). For interactive workloads with SLA requirements, c=32-64 is the sweet spot.

Dynamo Disaggregated Serving Experiments

Tested with NVIDIA Dynamo v0.9.1 on the single-node 8-GPU setup. Disaggregation splits GPUs into dedicated prefill and decode pools.

DeepSeek V3.2 (4 Prefill + 4 Decode GPUs)

Concurrency Aggregated tok/s Disaggregated tok/s Agg TTFT Disagg TTFT
1 70 85 2,264 ms 691 ms
32 1,210 772 1,685 ms 7,800 ms
128 3,194 822 1,647 ms 58,094 ms

Finding: Disaggregation wins at low concurrency (c=1): 3.3x TTFT improvement and 21% better throughput. At high concurrency, aggregated mode dominates due to memory constraints at TP=4 per worker.

Nemotron Super 49B bf16 (Aggregated TP=8 vs Disaggregated Splits)

Aggregated TP=8 dominates all disaggregated configurations: 4,900 tok/s (c=128) vs best disaggregated at 984 tok/s. NIXL transfer overhead (~800ms) exceeds prefill cost when the model fits on 1 GPU with 80 GB spare for KV cache.

Models That Cannot Be Disaggregated

  • GLM-5: OOM at TP=4 (weights 176 GB/GPU exceed 179 GB VRAM)
  • Nemotron Nano: Mamba hybrid KV cache incompatible with NIXL
  • MiniMax M2.5: vLLM 0.16.0 crash (n_group=0 incompatibility)
  • Nemotron Super 49B FP8: FlashInfer FP8 + NIXL layout assertion failure

When Disaggregation Helps

  • High input-to-output token ratios (RAG, summarization with 10K+ input)
  • SLA-driven workloads where TTFT matters more than throughput
  • Asymmetric GPU allocation (e.g., 6P+2D)
  • Multi-node deployments (where NIXL overhead is amortized)

Workarounds Required

  • --no-enable-flashinfer-autotune: sm_100 autotuner segfaults; uses pre-compiled cubins (benchmarked impact: <6% variance, production-viable)
  • --connector none for aggregated mode or bf16 models for disaggregated: Mamba hybrid KV cache incompatible with NIXL
  • --no-enable-prefix-caching: incompatible with hybrid KV cache

Key Findings

  1. Active parameter count predicts per-GPU throughput. 3B active (Nemotron Nano) = 9,415 tok/s/GPU; 10B (MiniMax) = 2,210; 37B (DeepSeek) = 546; 40B (GLM-5) = 267.

  2. NVFP4 is the NVIDIA HGX B200's killer feature. 1.65x cost-efficiency over FP8, with ~124,000 tok/s aggregate per node from 8 single-GPU instances.

  3. KV cache architecture determines scaling. Mamba, MLA, and Lightning Attention scale to ~512 concurrent. DSA (GLM-5) saturates at ~128 due to active parameter pressure on KV cache.

  4. Dynamo disaggregation is situational on single-node. Clear wins at low concurrency and high input ratios. Aggregated mode scales better at high concurrency due to per-worker memory constraints. Multi-node is where disaggregation shines.

  5. Goodput-optimal != throughput-optimal. Operators should target c=32-64 for interactive SLAs, c=256-512 for batch processing.

  6. All 5 models scale linearly to saturation with minimal failures on NVIDIA HGX B200, demonstrating production readiness of the hardware + vLLM stack.

Comments