Troubleshooting

Updated on 17 March, 2026

Common issues and verified solutions for vLLM on AMD Instinct GPUs.


GPU and System Issues

GPU Not Visible

bash
# Check device permissions
ls -la /dev/kfd /dev/dri

# Add user to required groups
sudo usermod -aG video,render $USER
# Log out and back in for changes to take effect

In Docker, ensure these flags are present:

bash
--device /dev/kfd \n--device /dev/dri \n--group-add=video

Multi-GPU Hanging

bash
# Verify NUMA balancing is disabled
cat /proc/sys/kernel/numa_balancing
# Should return 0

# Disable if enabled
sudo sysctl kernel.numa_balancing=0

For persistent fix:

bash
echo "kernel.numa_balancing=0" | sudo tee /etc/sysctl.d/99-numa.conf

Memory Issues

OOM During Model Load

Reduce memory requirements:

bash
# Lower context length
--max-model-len 16384

# Enable FP8 quantization
--quantization fp8

# Reduce GPU memory allocation
--gpu-memory-utilization 0.85

OOM During Inference

bash
# Reduce concurrent sequences
--max-num-seqs 128

# Enable KV cache offloading (GQA models only)
--kv-offloading-backend native \n--kv-offloading-size 64

AITER Issues

DeepSeek Crash on Startup

Symptom: Server crashes during startup or inference with no clear error message.

Cause: AITER_ENABLE_VSKIP defaults to true when unset, which causes crashes on MI300X/MI325X with DeepSeek models.

Fix:

bash
export AITER_ENABLE_VSKIP=0

DeepSeek Block Size Error

ValueError: Block size must be 1 for MLA models

MLA (Multi-head Latent Attention) architecture requires block size of 1:

bash
--block-size 1

MLA Accuracy Loss

Reported with certain DP/TP configurations. Workaround:

bash
export VLLM_ROCM_USE_AITER_MLA=0

MoE Performance Regression

Some AITER versions have MoE regressions:

bash
# Workaround 1: Disable AITER MOE
export VLLM_ROCM_USE_AITER_MOE=0

# Workaround 2: Use known-good image
docker pull rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103

Long Warmup Time

FP8 BMM kernel pre-compilation takes ~3 minutes on first run. Subsequent starts use cached kernels.

If unacceptable:

bash
export VLLM_USE_DEEP_GEMM=0

Model-Specific Issues

Kimi-K2.5: Architecture Not Supported

Model architectures ['KimiK25ForConditionalGeneration'] are not supported for now.

Kimi-K2.5 requires the nightly vLLM build, not the stable release:

bash
# Change from stable:
rocm/vllm:rocm6.3.4_mi325_ubuntu22.04_py3.12_vllm_0.9.0.1
# To nightly:
rocm/vllm-dev:nightly

Kimi-K2.5: Quantization Method Mismatch

Quantization method specified in the model config (compressed-tensors) does not match
the quantization method specified in the argument (fp8).

Kimi-K2.5 uses native INT4 Quantization-Aware Training (QAT), stored as compressed-tensors. Do not use --quantization fp8:

bash
# Wrong - causes error
--quantization fp8

# Correct - let vLLM auto-detect
# (simply omit the --quantization flag)

Kimi-K2.5: AITER MLA Head Count Error

Aiter MLA only supports 16 or 128 number of heads. Provided 8 number of heads.

Kimi-K2.5 has 64 attention heads. With TP=8: 64 / 8 = 8 heads per GPU (unsupported by AITER MLA).

Solution: Disable AITER and use TP=4:

bash
export VLLM_ROCM_USE_AITER=0
--tensor-parallel-size 4
Note
This means Kimi-K2.5 uses only 4 of 8 GPUs. Throughput is not directly comparable to TP=8 models.

Kimi-K2.5: MXFP4 Concerns

Initial reports suggested Kimi-K2.5 uses MXFP4 format requiring MI350+ hardware. This is incorrect.

Kimi-K2.5 uses INT4 Quantization-Aware Training (QAT) - the model was trained from scratch at INT4 precision, not post-training quantized. This is fully compatible with MI325X.

Kimi-K2.5: Working Configuration

Verified working configuration:

bash
docker run --rm \
  --name vllm-kimi-k25 \
  --ipc=host \
  --network=host \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "VLLM_ROCM_USE_AITER=0" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  rocm/vllm-dev:nightly \
  vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --block-size 1 \
  --mm-encoder-tp-mode data

Key flags:

  • VLLM_ROCM_USE_AITER=0 - Disables AITER (avoids MLA head count issues)
  • --tensor-parallel-size 4 - Required due to AITER constraints
  • --block-size 1 - Required for MLA backend on ROCm
  • --mm-encoder-tp-mode data - Vision encoder parallelism
  • No --quantization flag - Uses model's native INT4 compressed-tensors

DeepSeek: KV Cache Offloading Not Supported

KeyError: 'model.layers.0.self_attn.indexer.k_cache'

DeepSeek V3.2 uses MLA (Multi-head Latent Attention) which is incompatible with vLLM's KV cache offloading. The MLA architecture uses an indexer-based KV cache that the OffloadingConnector cannot handle.

Solution: Use the large HBM capacity instead (256GB per MI325X is sufficient).

DeepSeek: FP8 KV Cache Not Supported

Error: ROCMAiterMLASparseBackend doesn't support fp8 kv_cache_dtype

Do not use --kv-cache-dtype fp8 with DeepSeek models. vLLM automatically uses the correct fp8_ds_mla format.

DeepSeek: Chat Endpoint Returns Error

Add the tokenizer mode flag:

bash
--tokenizer-mode deepseek_v32

DeepSeek: Using Completions API

DeepSeek V3.2 works best with the completions API (no chat template):

bash
# Use completions endpoint
curl http://localhost:8000/v1/completions \
  -d '{"model": "deepseek-ai/DeepSeek-V3-0324", "prompt": "Hello", "max_tokens": 100}'

For chat format, ensure proper tokenizer configuration with --tokenizer-mode deepseek_v32.

GEMM Kernel Error (Large MoE Models)

RuntimeError: wrong! device_gemm with the specified compilation parameters does not support this GEMM problem

This occurs with large models (Llama-405B, Qwen3-VL-235B) when AITER GEMM kernels encounter incompatible configurations.

Fix: Add KV offloading flags, even if you don't need the extra memory:

bash
--kv-offloading-backend native \n--kv-offloading-size 64 \n--disable-hybrid-kv-cache-manager

Docker Issues

Permission Denied on /dev/kfd

bash
# Ensure video group is added
docker run ... --group-add=video ...

# Host user must be in video group
sudo usermod -aG video $USER

Container Crashes Silently

Ensure security options are set:

bash
--cap-add=SYS_PTRACE \n--security-opt seccomp=unconfined

IPC Errors with Multi-GPU

Shared memory is required for multi-GPU communication:

bash
--ipc=host

Performance Questions

GPU Utilization Appears Low

Despite high throughput (e.g., 47K tok/s), GPU compute utilization may show only 5-10% on dashboards. This is expected behavior:

  1. MoE Sparse Activation - Models like Qwen3-VL activate only 5-10% of parameters per token
  2. Memory Bandwidth Bound - LLM inference is limited by HBM bandwidth, not compute
  3. Efficient Batching - vLLM batches requests efficiently, reducing GPU stalls

How to verify actual saturation:

  • Watch for KV cache warnings: "cannot store X blocks"
  • Throughput plateaus despite increased concurrency
  • Consistent latency under heavy load

Throughput Lower Than Expected

If throughput is significantly lower than benchmarks:

  1. Check AITER is enabled: VLLM_ROCM_USE_AITER=1
  2. Verify tensor parallelism: Most models use TP=8, Kimi-K2.5 requires TP=4 with AITER disabled
  3. Check for KV cache pressure: Add --kv-offloading-size for GQA models
  4. Ensure no CPU bottleneck: Use --ipc=host for shared memory

Quick Reference

Error Likely Cause Fix
GPU not visible Permissions --group-add=video, user groups
Multi-GPU hang NUMA balancing sysctl kernel.numa_balancing=0
OOM on load Model too large --quantization fp8, reduce --max-model-len
DeepSeek crash VSKIP enabled AITER_ENABLE_VSKIP=0
Block size error MLA model --block-size 1
KV offload error MLA model Don't use KV offloading
GEMM kernel error Large MoE Add KV offloading flags
Chat endpoint error Missing tokenizer --tokenizer-mode deepseek_v32
Kimi architecture not supported Stable vLLM Use rocm/vllm-dev:nightly
Kimi quantization mismatch FP8 flag Remove --quantization fp8
Kimi head count error (8 heads) AITER + TP=8 VLLM_ROCM_USE_AITER=0 + TP=4
Low GPU utilization Memory bound Normal for LLM inference

Comments