Common issues and verified solutions for vLLM on AMD Instinct GPUs.
# Check device permissions
ls -la /dev/kfd /dev/dri
# Add user to required groups
sudo usermod -aG video,render $USER
# Log out and back in for changes to take effect
In Docker, ensure these flags are present:
--device /dev/kfd \n--device /dev/dri \n--group-add=video
# Verify NUMA balancing is disabled
cat /proc/sys/kernel/numa_balancing
# Should return 0
# Disable if enabled
sudo sysctl kernel.numa_balancing=0
For persistent fix:
echo "kernel.numa_balancing=0" | sudo tee /etc/sysctl.d/99-numa.conf
Reduce memory requirements:
# Lower context length
--max-model-len 16384
# Enable FP8 quantization
--quantization fp8
# Reduce GPU memory allocation
--gpu-memory-utilization 0.85
# Reduce concurrent sequences
--max-num-seqs 128
# Enable KV cache offloading (GQA models only)
--kv-offloading-backend native \n--kv-offloading-size 64
Symptom: Server crashes during startup or inference with no clear error message.
Cause: AITER_ENABLE_VSKIP defaults to true when unset, which causes crashes on MI300X/MI325X with DeepSeek models.
Fix:
export AITER_ENABLE_VSKIP=0
ValueError: Block size must be 1 for MLA modelsMLA (Multi-head Latent Attention) architecture requires block size of 1:
--block-size 1
Reported with certain DP/TP configurations. Workaround:
export VLLM_ROCM_USE_AITER_MLA=0
Some AITER versions have MoE regressions:
# Workaround 1: Disable AITER MOE
export VLLM_ROCM_USE_AITER_MOE=0
# Workaround 2: Use known-good image
docker pull rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103
FP8 BMM kernel pre-compilation takes ~3 minutes on first run. Subsequent starts use cached kernels.
If unacceptable:
export VLLM_USE_DEEP_GEMM=0
Model architectures ['KimiK25ForConditionalGeneration'] are not supported for now.Kimi-K2.5 requires the nightly vLLM build, not the stable release:
# Change from stable:
rocm/vllm:rocm6.3.4_mi325_ubuntu22.04_py3.12_vllm_0.9.0.1
# To nightly:
rocm/vllm-dev:nightly
Quantization method specified in the model config (compressed-tensors) does not match
the quantization method specified in the argument (fp8).Kimi-K2.5 uses native INT4 Quantization-Aware Training (QAT), stored as compressed-tensors. Do not use --quantization fp8:
# Wrong - causes error
--quantization fp8
# Correct - let vLLM auto-detect
# (simply omit the --quantization flag)
Aiter MLA only supports 16 or 128 number of heads. Provided 8 number of heads.Kimi-K2.5 has 64 attention heads. With TP=8: 64 / 8 = 8 heads per GPU (unsupported by AITER MLA).
Solution: Disable AITER and use TP=4:
export VLLM_ROCM_USE_AITER=0
--tensor-parallel-size 4
Initial reports suggested Kimi-K2.5 uses MXFP4 format requiring MI350+ hardware. This is incorrect.
Kimi-K2.5 uses INT4 Quantization-Aware Training (QAT) - the model was trained from scratch at INT4 precision, not post-training quantized. This is fully compatible with MI325X.
Verified working configuration:
docker run --rm \
--name vllm-kimi-k25 \
--ipc=host \
--network=host \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "VLLM_ROCM_USE_AITER=0" \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
rocm/vllm-dev:nightly \
vllm serve moonshotai/Kimi-K2.5 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--trust-remote-code \
--block-size 1 \
--mm-encoder-tp-mode data
Key flags:
VLLM_ROCM_USE_AITER=0 - Disables AITER (avoids MLA head count issues)--tensor-parallel-size 4 - Required due to AITER constraints--block-size 1 - Required for MLA backend on ROCm--mm-encoder-tp-mode data - Vision encoder parallelism--quantization flag - Uses model's native INT4 compressed-tensorsKeyError: 'model.layers.0.self_attn.indexer.k_cache'DeepSeek V3.2 uses MLA (Multi-head Latent Attention) which is incompatible with vLLM's KV cache offloading. The MLA architecture uses an indexer-based KV cache that the OffloadingConnector cannot handle.
Solution: Use the large HBM capacity instead (256GB per MI325X is sufficient).
Error: ROCMAiterMLASparseBackend doesn't support fp8 kv_cache_dtypeDo not use --kv-cache-dtype fp8 with DeepSeek models. vLLM automatically uses the correct fp8_ds_mla format.
Add the tokenizer mode flag:
--tokenizer-mode deepseek_v32
DeepSeek V3.2 works best with the completions API (no chat template):
# Use completions endpoint
curl http://localhost:8000/v1/completions \
-d '{"model": "deepseek-ai/DeepSeek-V3-0324", "prompt": "Hello", "max_tokens": 100}'
For chat format, ensure proper tokenizer configuration with --tokenizer-mode deepseek_v32.
RuntimeError: wrong! device_gemm with the specified compilation parameters does not support this GEMM problemThis occurs with large models (Llama-405B, Qwen3-VL-235B) when AITER GEMM kernels encounter incompatible configurations.
Fix: Add KV offloading flags, even if you don't need the extra memory:
--kv-offloading-backend native \n--kv-offloading-size 64 \n--disable-hybrid-kv-cache-manager
# Ensure video group is added
docker run ... --group-add=video ...
# Host user must be in video group
sudo usermod -aG video $USER
Ensure security options are set:
--cap-add=SYS_PTRACE \n--security-opt seccomp=unconfined
Shared memory is required for multi-GPU communication:
--ipc=host
Despite high throughput (e.g., 47K tok/s), GPU compute utilization may show only 5-10% on dashboards. This is expected behavior:
How to verify actual saturation:
"cannot store X blocks"If throughput is significantly lower than benchmarks:
VLLM_ROCM_USE_AITER=1--kv-offloading-size for GQA models--ipc=host for shared memory| Error | Likely Cause | Fix |
|---|---|---|
| GPU not visible | Permissions | --group-add=video, user groups |
| Multi-GPU hang | NUMA balancing | sysctl kernel.numa_balancing=0 |
| OOM on load | Model too large | --quantization fp8, reduce --max-model-len |
| DeepSeek crash | VSKIP enabled | AITER_ENABLE_VSKIP=0 |
| Block size error | MLA model | --block-size 1 |
| KV offload error | MLA model | Don't use KV offloading |
| GEMM kernel error | Large MoE | Add KV offloading flags |
| Chat endpoint error | Missing tokenizer | --tokenizer-mode deepseek_v32 |
| Kimi architecture not supported | Stable vLLM | Use rocm/vllm-dev:nightly |
| Kimi quantization mismatch | FP8 flag | Remove --quantization fp8 |
| Kimi head count error (8 heads) | AITER + TP=8 | VLLM_ROCM_USE_AITER=0 + TP=4 |
| Low GPU utilization | Memory bound | Normal for LLM inference |