FP8 and NVFP4 Quantization

Updated on 11 March, 2026

The NVIDIA HGX B200 GPU supports both FP8 (8-bit floating point) and NVFP4 (4-bit floating point) quantization natively in hardware. These formats reduce memory usage and increase throughput compared to BF16, with minimal quality loss.


Format Comparison

Format Bits Memory vs BF16 NVIDIA HGX B200 Support Use Case
BF16 16 1x (baseline) Yes Maximum quality, largest models
FP8 (E4M3) 8 0.5x Yes Standard inference quantization
NVFP4 4 0.25x NVIDIA HGX B200 only Maximum throughput, NVIDIA HGX B200-exclusive

FP8 on NVIDIA HGX B200

FP8 is the default quantization for large model inference on the NVIDIA HGX B200. Most model providers now ship official FP8 checkpoints.

Pre-Quantized FP8 Models

All five models in this cookbook use FP8:

console
# Nemotron Nano: official FP8 checkpoint
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --trust-remote-code

# Nemotron Super 49B: official FP8 checkpoint
$ vllm serve nvidia/Llama-3_3-Nemotron-Super-49B-v1_5-FP8 \
  --tensor-parallel-size 1 --trust-remote-code

# GLM-5: official FP8 checkpoint
$ vllm serve zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 --trust-remote-code

# MiniMax M2.5: native FP8 support
$ vllm serve MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 --trust-remote-code

# DeepSeek V3.2: on-the-fly FP8 quantization
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 --trust-remote-code --quantization fp8 --block-size 1

FP8 KV Cache

Separately from model weight quantization, you can quantize the KV cache to FP8. This reduces per-request memory usage and allows more concurrent requests:

console
$ vllm serve <model> --kv-cache-dtype fp8
Note
FP8 KV cache is independent of model quantization. You can use FP8 KV cache with a BF16 model, or skip it with an FP8 model. Not all architectures support it: MLA-based models (DeepSeek) compress KV differently and may not benefit.

NVFP4: NVIDIA HGX B200-Exclusive

NVFP4 is NVIDIA's 4-bit floating point format, supported only on Blackwell GPUs (NVIDIA HGX B200, B100, GB200). It halves memory compared to FP8 and doubles throughput for memory-bandwidth-bound workloads.

Available NVFP4 Models

Model NVFP4 Variant Source
Nemotron 3 Nano 30B nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 Official (NVIDIA)
MiniMax M2.5 lukealonso/MiniMax-M2.5-NVFP4 Community
GLM-5 lukealonso/GLM-5-NVFP4 Community

Deploying with NVFP4

console
# Nemotron Nano in NVFP4: fits on a single GPU
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

NVFP4 enables deployment scenarios that aren't possible with FP8:

  • Single-GPU deployment for models that would otherwise need TP=2
  • Higher context lengths within the same VRAM budget
  • More concurrent requests at the same context length

When to Use NVFP4

Scenario Recommendation
Maximize throughput NVFP4 (2x throughput vs FP8 for memory-bound models)
Maximize quality FP8 (higher precision)
Reduce GPU count NVFP4 (half the VRAM, potentially half the GPUs)
Production serving FP8 (better studied, official checkpoints from more providers)
NVIDIA HGX B200-specific showcase NVFP4 (demonstrates unique hardware capability)

Memory Budget Calculator

For a given model and quantization format on the NVIDIA HGX B200 (179 GB VRAM per GPU):

Available VRAM per GPU = 179 GB × gpu_memory_utilization
Model memory per GPU = model_size_GB / tensor_parallel_size
KV cache per GPU = Available - Model memory
Max concurrent = KV cache per GPU / (per_request_kv_size × context_length)

Example: Nemotron Nano 30B

Format Model Size TP Model/GPU KV Available/GPU Relative Capacity
BF16 ~60 GB 2 ~30 GB ~131 GB 1x
FP8 ~30 GB 2 ~15 GB ~146 GB 1.1x
FP8 ~30 GB 1 ~30 GB ~131 GB 1x (single GPU)
NVFP4 ~15 GB 1 ~15 GB ~146 GB 1.1x (single GPU)

The biggest win from NVFP4 isn't more KV cache on the same GPU count: it's being able to drop TP entirely and run on fewer GPUs.

Benchmark: FP8 vs NVFP4 (Nemotron Nano 30B, NVIDIA HGX B200 Verified)

Metric FP8 (TP=2, 2 GPUs) NVFP4 (TP=1, 1 GPU)
Peak sustained tok/s 18,829 15,575
tok/s per GPU 9,415 15,575
VRAM used per GPU 169 GB 173 GB
Max instances per node 4 8
Aggregate node tok/s ~75,000 ~124,000

NVFP4 delivers 1.65x better cost efficiency than FP8 by eliminating the second GPU. On a full 8-GPU node, 8 NVFP4 instances produce ~124,000 tok/s aggregate vs ~75,000 tok/s from 4 FP8 instances. Zero failed requests across all concurrency levels for both formats.

Comments