Focus Mode

FP8 Quantization

Updated on 17 March, 2026

Reduce memory usage and improve throughput with FP8 quantization on AMD Instinct GPUs.

Overview

FP8 quantization reduces model precision from 16-bit to 8-bit, providing:

50% memory reduction for model weights
Throughput improvement (enables larger models that wouldn't fit in BF16)
Minimal accuracy loss for most workloads

Basic Usage

                            bash
                            
                        
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --quantization fp8

FP8 Methods

Standard FP8 (W8A8)

bash

--quantization fp8

Quantizes both weights and activations to FP8.

PTPC-FP8 (Recommended for ROCm)

bash

--quantization ptpc_fp8

Per-Token Per-Channel FP8:

Per-token scaling for activations
Per-channel scaling for weights
Better accuracy than standard FP8
Recommended for AMD ROCm (vLLM v0.7.3+)

FP8 KV Cache

bash

--kv-cache-dtype fp8

Additionally quantizes the KV cache for further memory savings.

Warning

Do not use --kv-cache-dtype fp8 with DeepSeek models. The ROCMAiterMLASparseBackend doesn't support it.

Memory Comparison

Configuration	Memory Usage	Notes
BF16	100% (baseline)	Maximum accuracy
FP8	50%	Good accuracy
FP8 + FP8 KV	40-45%	Maximum savings

Model Compatibility

Model	FP8 Quantization	FP8 KV Cache
DeepSeek V3.2	Yes	No (auto uses fp8_ds_mla)
Llama-3.1-405B	Yes	Yes
Qwen3-VL-235B	No (ViT dimensions)	No
Kimi-K2.5	No (INT4 QAT native)	No

Performance Results (MI325X)

Throughput Comparison

Model	BF16	FP8	Improvement
Qwen3-VL-235B	26,674 tok/s	N/A	-
Llama-3.1-405B	N/A	6,464 tok/s	Enables large model

FP8 enables running large dense models like Llama-405B that wouldn't fit in BF16.

Example Configurations

High-Throughput FP8 (Llama 3.1 405B)

                            bash
                            
                        
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95

Memory-Constrained FP8

                            bash
                            
                        
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 16384

Troubleshooting

Shape Not Divisible by 16 (Vision Models)

RuntimeError: mat2 shape (1152x538) must be divisible by 16

Root cause: Vision encoder (ViT) MLP dimensions are not compatible with ROCm's FP8 kernels which require dimensions divisible by 16.

Fix: Use BF16 (no quantization) for Vision-Language models like Qwen3-VL:

                            bash
                            
                        
# Remove --quantization fp8 and --kv-cache-dtype fp8
--model Qwen/Qwen3-VL-235B-A22B-Instruct
# No quantization flags

ROCMAiterMLASparseBackend Error (DeepSeek)

ValueError: ROCMAiterMLASparseBackend doesn't support fp8 kv_cache_dtype

Root cause: DeepSeek's MLA architecture uses a special KV cache format (fp8_ds_mla) that is incompatible with the standard --kv-cache-dtype fp8 flag.

Fix: Do not use --kv-cache-dtype fp8 with DeepSeek models. vLLM automatically selects the appropriate format.

Accuracy Degradation

If you notice accuracy issues:

Switch to PTPC-FP8: --quantization ptpc_fp8
Use BF16 for KV cache: remove --kv-cache-dtype fp8
Fall back to BF16: remove --quantization flag

Known Incompatibilities

Model	FP8 Issue	Workaround
Qwen3-VL-235B	ViT dimensions not divisible by 16	Use BF16
DeepSeek V3.2	MLA backend incompatible with fp8 KV	Omit `--kv-cache-dtype fp8`
Kimi-K2.5	Uses native INT4 QAT (compressed-tensors)	Omit `--quantization` flag

FP8 Quantization

Overview

Basic Usage

FP8 Methods

Standard FP8 (W8A8)

PTPC-FP8 (Recommended for ROCm)

FP8 KV Cache

Memory Comparison

Model Compatibility

Performance Results (MI325X)

Throughput Comparison

Example Configurations

High-Throughput FP8 (Llama 3.1 405B)

Memory-Constrained FP8

Troubleshooting

Shape Not Divisible by 16 (Vision Models)

ROCMAiterMLASparseBackend Error (DeepSeek)

Accuracy Degradation

Known Incompatibilities

Comments

Products

Features

Solutions

Marketplace

Resources

Company