FP8 Quantization

Updated on 17 March, 2026

Reduce memory usage and improve throughput with FP8 quantization on AMD Instinct GPUs.


Overview

FP8 quantization reduces model precision from 16-bit to 8-bit, providing:

  • 50% memory reduction for model weights
  • Throughput improvement (enables larger models that wouldn't fit in BF16)
  • Minimal accuracy loss for most workloads

Basic Usage

bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --quantization fp8

FP8 Methods

Standard FP8 (W8A8)

bash
--quantization fp8

Quantizes both weights and activations to FP8.

PTPC-FP8 (Recommended for ROCm)

bash
--quantization ptpc_fp8

Per-Token Per-Channel FP8:

  • Per-token scaling for activations
  • Per-channel scaling for weights
  • Better accuracy than standard FP8
  • Recommended for AMD ROCm (vLLM v0.7.3+)

FP8 KV Cache

bash
--kv-cache-dtype fp8

Additionally quantizes the KV cache for further memory savings.

Warning
Do not use --kv-cache-dtype fp8 with DeepSeek models. The ROCMAiterMLASparseBackend doesn't support it.

Memory Comparison

Configuration Memory Usage Notes
BF16 100% (baseline) Maximum accuracy
FP8 50% Good accuracy
FP8 + FP8 KV 40-45% Maximum savings

Model Compatibility

Model FP8 Quantization FP8 KV Cache
DeepSeek V3.2 Yes No (auto uses fp8_ds_mla)
Llama-3.1-405B Yes Yes
Qwen3-VL-235B No (ViT dimensions) No
Kimi-K2.5 No (INT4 QAT native) No

Performance Results (MI325X)

Throughput Comparison

Model BF16 FP8 Improvement
Qwen3-VL-235B 26,674 tok/s N/A -
Llama-3.1-405B N/A 6,464 tok/s Enables large model

FP8 enables running large dense models like Llama-405B that wouldn't fit in BF16.

Example Configurations

High-Throughput FP8 (Llama 3.1 405B)

bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95

Memory-Constrained FP8

bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 16384

Troubleshooting

Shape Not Divisible by 16 (Vision Models)

RuntimeError: mat2 shape (1152x538) must be divisible by 16

Root cause: Vision encoder (ViT) MLP dimensions are not compatible with ROCm's FP8 kernels which require dimensions divisible by 16.

Fix: Use BF16 (no quantization) for Vision-Language models like Qwen3-VL:

bash
# Remove --quantization fp8 and --kv-cache-dtype fp8
--model Qwen/Qwen3-VL-235B-A22B-Instruct
# No quantization flags

ROCMAiterMLASparseBackend Error (DeepSeek)

ValueError: ROCMAiterMLASparseBackend doesn't support fp8 kv_cache_dtype

Root cause: DeepSeek's MLA architecture uses a special KV cache format (fp8_ds_mla) that is incompatible with the standard --kv-cache-dtype fp8 flag.

Fix: Do not use --kv-cache-dtype fp8 with DeepSeek models. vLLM automatically selects the appropriate format.

Accuracy Degradation

If you notice accuracy issues:

  1. Switch to PTPC-FP8: --quantization ptpc_fp8
  2. Use BF16 for KV cache: remove --kv-cache-dtype fp8
  3. Fall back to BF16: remove --quantization flag

Known Incompatibilities

Model FP8 Issue Workaround
Qwen3-VL-235B ViT dimensions not divisible by 16 Use BF16
DeepSeek V3.2 MLA backend incompatible with fp8 KV Omit --kv-cache-dtype fp8
Kimi-K2.5 Uses native INT4 QAT (compressed-tensors) Omit --quantization flag

Comments