Focus Mode

FP8 and NVFP4 Quantization

Updated on 11 March, 2026

The NVIDIA HGX B200 GPU supports both FP8 (8-bit floating point) and NVFP4 (4-bit floating point) quantization natively in hardware. These formats reduce memory usage and increase throughput compared to BF16, with minimal quality loss.

Format Comparison

Format	Bits	Memory vs BF16	NVIDIA HGX B200 Support	Use Case
BF16	16	1x (baseline)	Yes	Maximum quality, largest models
FP8 (E4M3)	8	0.5x	Yes	Standard inference quantization
NVFP4	4	0.25x	NVIDIA HGX B200 only	Maximum throughput, NVIDIA HGX B200-exclusive

FP8 on NVIDIA HGX B200

FP8 is the default quantization for large model inference on the NVIDIA HGX B200. Most model providers now ship official FP8 checkpoints.

Pre-Quantized FP8 Models

All five models in this cookbook use FP8:

                            console
                            
                        
# Nemotron Nano: official FP8 checkpoint
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --trust-remote-code

# Nemotron Super 49B: official FP8 checkpoint
$ vllm serve nvidia/Llama-3_3-Nemotron-Super-49B-v1_5-FP8 \
  --tensor-parallel-size 1 --trust-remote-code

# GLM-5: official FP8 checkpoint
$ vllm serve zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 --trust-remote-code

# MiniMax M2.5: native FP8 support
$ vllm serve MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 --trust-remote-code

# DeepSeek V3.2: on-the-fly FP8 quantization
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 --trust-remote-code --quantization fp8 --block-size 1

FP8 KV Cache

Separately from model weight quantization, you can quantize the KV cache to FP8. This reduces per-request memory usage and allows more concurrent requests:

                            console
                            
$ vllm serve <model> --kv-cache-dtype fp8

Note

FP8 KV cache is independent of model quantization. You can use FP8 KV cache with a BF16 model, or skip it with an FP8 model. Not all architectures support it: MLA-based models (DeepSeek) compress KV differently and may not benefit.

NVFP4: NVIDIA HGX B200-Exclusive

NVFP4 is NVIDIA's 4-bit floating point format, supported only on Blackwell GPUs (NVIDIA HGX B200, B100, GB200). It halves memory compared to FP8 and doubles throughput for memory-bandwidth-bound workloads.

Available NVFP4 Models

Model	NVFP4 Variant	Source
Nemotron 3 Nano 30B	`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`	Official (NVIDIA)
MiniMax M2.5	`lukealonso/MiniMax-M2.5-NVFP4`	Community
GLM-5	`lukealonso/GLM-5-NVFP4`	Community

Deploying with NVFP4

                            console
                            
                        
# Nemotron Nano in NVFP4: fits on a single GPU
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

NVFP4 enables deployment scenarios that aren't possible with FP8:

Single-GPU deployment for models that would otherwise need TP=2
Higher context lengths within the same VRAM budget
More concurrent requests at the same context length

When to Use NVFP4

Scenario	Recommendation
Maximize throughput	NVFP4 (2x throughput vs FP8 for memory-bound models)
Maximize quality	FP8 (higher precision)
Reduce GPU count	NVFP4 (half the VRAM, potentially half the GPUs)
Production serving	FP8 (better studied, official checkpoints from more providers)
NVIDIA HGX B200-specific showcase	NVFP4 (demonstrates unique hardware capability)

Memory Budget Calculator

For a given model and quantization format on the NVIDIA HGX B200 (179 GB VRAM per GPU):

Available VRAM per GPU = 179 GB × gpu_memory_utilization
Model memory per GPU = model_size_GB / tensor_parallel_size
KV cache per GPU = Available - Model memory
Max concurrent = KV cache per GPU / (per_request_kv_size × context_length)

Example: Nemotron Nano 30B

Format	Model Size	TP	Model/GPU	KV Available/GPU	Relative Capacity
BF16	~60 GB	2	~30 GB	~131 GB	1x
FP8	~30 GB	2	~15 GB	~146 GB	1.1x
FP8	~30 GB	1	~30 GB	~131 GB	1x (single GPU)
NVFP4	~15 GB	1	~15 GB	~146 GB	1.1x (single GPU)

The biggest win from NVFP4 isn't more KV cache on the same GPU count: it's being able to drop TP entirely and run on fewer GPUs.

Benchmark: FP8 vs NVFP4 (Nemotron Nano 30B, NVIDIA HGX B200 Verified)

Metric	FP8 (TP=2, 2 GPUs)	NVFP4 (TP=1, 1 GPU)
Peak sustained tok/s	18,829	15,575
tok/s per GPU	9,415	15,575
VRAM used per GPU	169 GB	173 GB
Max instances per node	4	8
Aggregate node tok/s	~75,000	~124,000

NVFP4 delivers 1.65x better cost efficiency than FP8 by eliminating the second GPU. On a full 8-GPU node, 8 NVFP4 instances produce ~124,000 tok/s aggregate vs ~75,000 tok/s from 4 FP8 instances. Zero failed requests across all concurrency levels for both formats.

FP8 and NVFP4 Quantization

Format Comparison

FP8 on NVIDIA HGX B200

Pre-Quantized FP8 Models

FP8 KV Cache

NVFP4: NVIDIA HGX B200-Exclusive

Available NVFP4 Models

Deploying with NVFP4

When to Use NVFP4

Memory Budget Calculator

Example: Nemotron Nano 30B

Benchmark: FP8 vs NVFP4 (Nemotron Nano 30B, NVIDIA HGX B200 Verified)

Comments

Products

Features

Solutions

Marketplace

Resources

Company