Deploy and optimize large language models on NVIDIA HGX B200 GPUs with vLLM. This cookbook provides tested configurations, benchmark data, and optimization strategies for serving frontier models on NVIDIA HGX B200 infrastructure.
| Specification | Value |
|---|---|
| GPU | 8x NVIDIA HGX B200 |
| VRAM | 179 GB HBM3e per GPU (1.43 TB total) |
| Memory Bandwidth | 8.0 TB/s per GPU |
| Compute (FP8) | ~4.5 PFLOPS per GPU |
| Interconnect | NVSwitch 5.0 + NVLink 5.0 (1.8 TB/s bidirectional) |
| TDP | 1000W per GPU |
| Driver | 580.105.08 |
| CUDA | 13.0 |
This cookbook benchmarks five models that represent the architectural diversity of frontier open-source LLMs in early 2026. Each model uses a different attention mechanism and stresses the NVIDIA HGX B200's hardware subsystems differently.
| Model | Params (Active) | Attention | TP | GPU Story |
|---|---|---|---|---|
| Nemotron 3 Nano 30B | 30B (3B) | Mamba hybrid | 2 (FP8) / 1 (NVFP4) | Single-GPU efficiency, NVFP4 |
| MiniMax M2.5 229B | 229B (10B) | Lightning Attention | 4 | Mid-scale, linear attention |
| GLM-5 744B | 744B (40B) | DSA | 8 | Frontier open-source |
| Nemotron Super 49B | 49B (49B) | Standard MHA | 1 | Dense transformer, Dynamo disagg testing |
| DeepSeek V3.2 685B | 685B (37B) | MLA | 8 | Frontier MoE, compressed KV |
Four of the five models use Mixture-of-Experts (MoE) routing: this reflects the current state of open-source LLMs. Llama 3.1 405B (July 2024) remains the only large dense open-source model; every successor (Llama 4, DeepSeek V3, GLM-5, Qwen3, Kimi K2.5) uses MoE. The exception is Nemotron Super 49B, a dense NAS-optimized transformer included for its compatibility with NVIDIA Dynamo's disaggregated serving.
What varies across our five models is more interesting than the dense/MoE distinction: five different attention mechanisms (Mamba SSM, standard MHA, Lightning linear attention, Differential Sparse Attention, Multi-Latent Attention), five different KV cache profiles, and active parameter counts spanning 3B to 49B.
# Verify GPUs
$ nvidia-smi
# Serve Nemotron Nano 30B (smallest model, fastest to start)
$ pip install vllm
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
# Test it
$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64}'
The NVIDIA HGX B200 is a Blackwell-architecture data center GPU designed for large-scale AI inference and training. NVIDIA HGX B200 servers provide 8 GPUs per node connected via NVSwitch 5.0.
Deploy your first model on NVIDIA HGX B200 GPUs and verify the setup end-to-end.
Set up an NVIDIA HGX B200 instance for LLM inference with vLLM.
Deploy NVIDIA's Nemotron 3 Nano on NVIDIA HGX B200 GPUs. This hybrid Mamba-Transformer model delivers high throughput with only 3B active parameters per token.
Deploy NVIDIA's Nemotron Super 49B on NVIDIA HGX B200 GPUs. This dense transformer model is based on Llama 3.3 with NAS-optimized architecture, delivering strong reasoning performance.
Deploy DeepSeek's V3.2 on NVIDIA HGX B200 GPUs. This MoE model uses Multi-Latent Attention (MLA) for compressed KV caching, delivering strong reasoning performance at 685B parameters.
Deploy THUDM/Zhipu's GLM-5 on NVIDIA HGX B200 GPUs. This large MoE model introduces Differential Sparse Attention for efficient inference at 744B total parameters.
Deploy MiniMax's M2.5 on NVIDIA HGX B200 GPUs. This MoE model combines Lightning Attention with traditional SoftMax attention for efficient long-context inference.
NVIDIA Dynamo is an open-source inference framework that adds disaggregated serving, intelligent routing, and tiered KV caching on top of vLLM. Version 0.9.1 supports single-node deployment with in-memory service discovery (--store-kv mem): no external infrastructure (etcd, NATS) required.
Executive summary of tested configurations, benchmarks, and optimization strategies for LLM inference on NVIDIA HGX B200 GPUs with vLLM.
The KV cache is often the dominant consumer of GPU VRAM during inference. With 179 GB per NVIDIA HGX B200 GPU, efficient cache management determines how many concurrent requests you can serve and at what context length.
Extend effective memory by offloading KV cache to CPU memory when GPU HBM is insufficient.
The NVIDIA HGX B200 GPU supports both FP8 (8-bit floating point) and NVFP4 (4-bit floating point) quantization natively in hardware. These formats reduce memory usage and increase throughput compared to BF16, with minimal quality loss.
vLLM on the NVIDIA HGX B200 uses specialized CUDA kernel backends for attention and GEMM operations. Understanding these backends helps diagnose startup issues, explain warmup times, and tune performance.
Maximize throughput by tuning vLLM for concurrent request loads on NVIDIA HGX B200 GPUs.