Reduce memory usage and improve throughput with FP8 quantization on AMD Instinct GPUs.
FP8 quantization reduces model precision from 16-bit to 8-bit, providing:
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model MODEL_NAME \
--quantization fp8
--quantization fp8
Quantizes both weights and activations to FP8.
--quantization ptpc_fp8
Per-Token Per-Channel FP8:
--kv-cache-dtype fp8
Additionally quantizes the KV cache for further memory savings.
--kv-cache-dtype fp8 with DeepSeek models. The ROCMAiterMLASparseBackend doesn't support it.
| Configuration | Memory Usage | Notes |
|---|---|---|
| BF16 | 100% (baseline) | Maximum accuracy |
| FP8 | 50% | Good accuracy |
| FP8 + FP8 KV | 40-45% | Maximum savings |
| Model | FP8 Quantization | FP8 KV Cache |
|---|---|---|
| DeepSeek V3.2 | Yes | No (auto uses fp8_ds_mla) |
| Llama-3.1-405B | Yes | Yes |
| Qwen3-VL-235B | No (ViT dimensions) | No |
| Kimi-K2.5 | No (INT4 QAT native) | No |
| Model | BF16 | FP8 | Improvement |
|---|---|---|---|
| Qwen3-VL-235B | 26,674 tok/s | N/A | - |
| Llama-3.1-405B | N/A | 6,464 tok/s | Enables large model |
FP8 enables running large dense models like Llama-405B that wouldn't fit in BF16.
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--quantization fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model MODEL_NAME \
--quantization fp8 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85 \
--max-model-len 16384
RuntimeError: mat2 shape (1152x538) must be divisible by 16Root cause: Vision encoder (ViT) MLP dimensions are not compatible with ROCm's FP8 kernels which require dimensions divisible by 16.
Fix: Use BF16 (no quantization) for Vision-Language models like Qwen3-VL:
# Remove --quantization fp8 and --kv-cache-dtype fp8
--model Qwen/Qwen3-VL-235B-A22B-Instruct
# No quantization flags
ValueError: ROCMAiterMLASparseBackend doesn't support fp8 kv_cache_dtypeRoot cause: DeepSeek's MLA architecture uses a special KV cache format (fp8_ds_mla) that is incompatible with the standard --kv-cache-dtype fp8 flag.
Fix: Do not use --kv-cache-dtype fp8 with DeepSeek models. vLLM automatically selects the appropriate format.
If you notice accuracy issues:
--quantization ptpc_fp8--kv-cache-dtype fp8--quantization flag| Model | FP8 Issue | Workaround |
|---|---|---|
| Qwen3-VL-235B | ViT dimensions not divisible by 16 | Use BF16 |
| DeepSeek V3.2 | MLA backend incompatible with fp8 KV | Omit --kv-cache-dtype fp8 |
| Kimi-K2.5 | Uses native INT4 QAT (compressed-tensors) | Omit --quantization flag |