Configure AMD's AI Tensor Engine for ROCm (AITER) to accelerate vLLM inference.
AITER provides optimized kernels for AMD Instinct GPUs, offering:
| Workload | Improvement |
|---|---|
| DeepSeek V3/R1 | 2.1x faster |
| Block-scale GEMM | 2x faster |
| Fused MoE | 3x faster |
Source: AMD documentation. Actual performance varies by workload, model size, and configuration.
export VLLM_ROCM_USE_AITER=1
Or in Docker:
docker run --rm \
--env "VLLM_ROCM_USE_AITER=1" \
...
Default: AITER is OFF (0)
When VLLM_ROCM_USE_AITER=1, these components are automatically enabled:
| Flag | Purpose |
|---|---|
VLLM_ROCM_USE_AITER_LINEAR |
Quantization + GEMM |
VLLM_ROCM_USE_AITER_MOE |
Fused Mixture-of-Experts |
VLLM_ROCM_USE_AITER_RMSNORM |
Accelerated normalization |
VLLM_ROCM_USE_AITER_MHA |
Multi-Head Attention |
VLLM_ROCM_USE_AITER_MLA |
Multi-head Latent Attention |
VLLM_ROCM_USE_AITER_FP8BMM |
FP8 batched matmul |
| Flag | Purpose |
|---|---|
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM |
FP4 assembly GEMM |
VLLM_ROCM_USE_SKINNY_GEMM |
Small batch optimization |
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--quantization fp8 \
--max-model-len 32768
export VLLM_ROCM_USE_AITER=1
export AITER_ENABLE_VSKIP=0 # CRITICAL - prevents crashes
--block-size 1 (mandatory for MLA).
VLLM_ROCM_USE_AITER=1 \
AITER_ENABLE_VSKIP=0 \
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--block-size 1 \
--quantization fp8
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
--env "VLLM_ROCM_USE_AITER=0" \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
-p 8000:8000 \
rocm/vllm-dev:nightly \
vllm serve moonshotai/Kimi-K2.5 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--trust-remote-code \
--block-size 1 \
--mm-encoder-tp-mode data
Key differences from other models:
VLLM_ROCM_USE_AITER=0 - AITER disabled (not enabled)--tensor-parallel-size 4 - Only uses 4 GPUs (not 8)rocm/vllm-dev:nightly imagedocker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--kv-offloading-backend native \
--kv-offloading-size 64 \
--disable-hybrid-kv-cache-manager
VLLM_ROCM_USE_AITER=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8
| Model | AITER | MHA | MLA | MOE | Block Size | TP |
|---|---|---|---|---|---|---|
| Llama 3.1 (405B) | 1 | 1 | 0 | 0 | 16 | 8 |
| Qwen3-VL (235B) | 1 | 1 | 0 | 1 | 16 | 8 |
| DeepSeek V3.2 (685B) | 1 | 0 | 1 | 1 | 1 | 8 |
| Kimi-K2.5 (1T) | 0 | – | – | – | 1 | 4 |
Kimi-K2.5 requires AITER disabled (VLLM_ROCM_USE_AITER=0) due to MXFP4 hardware requirements and attention head count incompatibility; component flags are therefore not applicable.
Check the startup logs:
VLLM_ROCM_USE_AITER=1 vllm serve MODEL_NAME 2>&1 | grep -i "aiter\|attention"
Look for:
ValueError: Block size must be 1 for MLA modelsMLA (Multi-head Latent Attention) architecture requires block size of 1. Always include:
--block-size 1
# Error: Server crashes during inference with no clear error messageFix: Set AITER_ENABLE_VSKIP=0:
export AITER_ENABLE_VSKIP=0
Root cause: AITER_ENABLE_VSKIP defaults to true when unset, which causes crashes on MI300X/MI325X with DeepSeek models.
Reported with certain DP/TP configurations:
# Workaround: Disable AITER MLA
export VLLM_ROCM_USE_AITER_MLA=0
Some AITER versions have MoE regressions:
# Workaround 1: Disable AITER MOE
export VLLM_ROCM_USE_AITER_MOE=0
# Workaround 2: Use known-good image
docker pull rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103
RuntimeError: wrong! device_gemm with the specified compilation parameters does not support this GEMM problemFor MoE models like Qwen3-VL, add KV offloading flags:
--kv-offloading-backend native \
--kv-offloading-size 64 \
--disable-hybrid-kv-cache-manager
FP8 BMM kernel pre-compilation takes ~3 minutes on first run:
# If unacceptable, disable DeepGEMM
export VLLM_USE_DEEP_GEMM=0
For large tensor parallel configurations:
# Quantized all-reduce
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION="FP"
# BF16 to FP16 cast for performance
export VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16=1
Beneficial for TP > 4 and high concurrency.