The NVIDIA HGX B200 GPU supports both FP8 (8-bit floating point) and NVFP4 (4-bit floating point) quantization natively in hardware. These formats reduce memory usage and increase throughput compared to BF16, with minimal quality loss.
| Format | Bits | Memory vs BF16 | NVIDIA HGX B200 Support | Use Case |
|---|---|---|---|---|
| BF16 | 16 | 1x (baseline) | Yes | Maximum quality, largest models |
| FP8 (E4M3) | 8 | 0.5x | Yes | Standard inference quantization |
| NVFP4 | 4 | 0.25x | NVIDIA HGX B200 only | Maximum throughput, NVIDIA HGX B200-exclusive |
FP8 is the default quantization for large model inference on the NVIDIA HGX B200. Most model providers now ship official FP8 checkpoints.
All five models in this cookbook use FP8:
# Nemotron Nano: official FP8 checkpoint
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 --trust-remote-code
# Nemotron Super 49B: official FP8 checkpoint
$ vllm serve nvidia/Llama-3_3-Nemotron-Super-49B-v1_5-FP8 \
--tensor-parallel-size 1 --trust-remote-code
# GLM-5: official FP8 checkpoint
$ vllm serve zai-org/GLM-5-FP8 \
--tensor-parallel-size 8 --trust-remote-code
# MiniMax M2.5: native FP8 support
$ vllm serve MiniMaxAI/MiniMax-M2.5 \
--tensor-parallel-size 4 --trust-remote-code
# DeepSeek V3.2: on-the-fly FP8 quantization
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 8 --trust-remote-code --quantization fp8 --block-size 1
Separately from model weight quantization, you can quantize the KV cache to FP8. This reduces per-request memory usage and allows more concurrent requests:
$ vllm serve <model> --kv-cache-dtype fp8
NVFP4 is NVIDIA's 4-bit floating point format, supported only on Blackwell GPUs (NVIDIA HGX B200, B100, GB200). It halves memory compared to FP8 and doubles throughput for memory-bandwidth-bound workloads.
| Model | NVFP4 Variant | Source |
|---|---|---|
| Nemotron 3 Nano 30B | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |
Official (NVIDIA) |
| MiniMax M2.5 | lukealonso/MiniMax-M2.5-NVFP4 |
Community |
| GLM-5 | lukealonso/GLM-5-NVFP4 |
Community |
# Nemotron Nano in NVFP4: fits on a single GPU
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
NVFP4 enables deployment scenarios that aren't possible with FP8:
| Scenario | Recommendation |
|---|---|
| Maximize throughput | NVFP4 (2x throughput vs FP8 for memory-bound models) |
| Maximize quality | FP8 (higher precision) |
| Reduce GPU count | NVFP4 (half the VRAM, potentially half the GPUs) |
| Production serving | FP8 (better studied, official checkpoints from more providers) |
| NVIDIA HGX B200-specific showcase | NVFP4 (demonstrates unique hardware capability) |
For a given model and quantization format on the NVIDIA HGX B200 (179 GB VRAM per GPU):
Available VRAM per GPU = 179 GB × gpu_memory_utilization
Model memory per GPU = model_size_GB / tensor_parallel_size
KV cache per GPU = Available - Model memory
Max concurrent = KV cache per GPU / (per_request_kv_size × context_length)| Format | Model Size | TP | Model/GPU | KV Available/GPU | Relative Capacity |
|---|---|---|---|---|---|
| BF16 | ~60 GB | 2 | ~30 GB | ~131 GB | 1x |
| FP8 | ~30 GB | 2 | ~15 GB | ~146 GB | 1.1x |
| FP8 | ~30 GB | 1 | ~30 GB | ~131 GB | 1x (single GPU) |
| NVFP4 | ~15 GB | 1 | ~15 GB | ~146 GB | 1.1x (single GPU) |
The biggest win from NVFP4 isn't more KV cache on the same GPU count: it's being able to drop TP entirely and run on fewer GPUs.
| Metric | FP8 (TP=2, 2 GPUs) | NVFP4 (TP=1, 1 GPU) |
|---|---|---|
| Peak sustained tok/s | 18,829 | 15,575 |
| tok/s per GPU | 9,415 | 15,575 |
| VRAM used per GPU | 169 GB | 173 GB |
| Max instances per node | 4 | 8 |
| Aggregate node tok/s | ~75,000 | ~124,000 |
NVFP4 delivers 1.65x better cost efficiency than FP8 by eliminating the second GPU. On a full 8-GPU node, 8 NVFP4 instances produce ~124,000 tok/s aggregate vs ~75,000 tok/s from 4 FP8 instances. Zero failed requests across all concurrency levels for both formats.