Focus Mode

Troubleshooting

Updated on 17 March, 2026

Common issues and verified solutions for vLLM on AMD Instinct GPUs.

GPU and System Issues

GPU Not Visible

                            bash
                            
                        
# Check device permissions
ls -la /dev/kfd /dev/dri

# Add user to required groups
sudo usermod -aG video,render $USER
# Log out and back in for changes to take effect

In Docker, ensure these flags are present:

                            bash
                            
--device /dev/kfd \n--device /dev/dri \n--group-add=video

Multi-GPU Hanging

                            bash
                            
                        
# Verify NUMA balancing is disabled
cat /proc/sys/kernel/numa_balancing
# Should return 0

# Disable if enabled
sudo sysctl kernel.numa_balancing=0

For persistent fix:

                            bash
                            
echo "kernel.numa_balancing=0" | sudo tee /etc/sysctl.d/99-numa.conf

Memory Issues

OOM During Model Load

Reduce memory requirements:

                            bash
                            
                        
# Lower context length
--max-model-len 16384

# Enable FP8 quantization
--quantization fp8

# Reduce GPU memory allocation
--gpu-memory-utilization 0.85

OOM During Inference

                            bash
                            
                        
# Reduce concurrent sequences
--max-num-seqs 128

# Enable KV cache offloading (GQA models only)
--kv-offloading-backend native \n--kv-offloading-size 64

AITER Issues

DeepSeek Crash on Startup

Symptom: Server crashes during startup or inference with no clear error message.

Cause: AITER_ENABLE_VSKIP defaults to true when unset, which causes crashes on MI300X/MI325X with DeepSeek models.

Fix:

                            bash
                            
export AITER_ENABLE_VSKIP=0

DeepSeek Block Size Error

ValueError: Block size must be 1 for MLA models

MLA (Multi-head Latent Attention) architecture requires block size of 1:

bash

--block-size 1

MLA Accuracy Loss

Reported with certain DP/TP configurations. Workaround:

                            bash
                            
export VLLM_ROCM_USE_AITER_MLA=0

MoE Performance Regression

Some AITER versions have MoE regressions:

                            bash
                            
                        
# Workaround 1: Disable AITER MOE
export VLLM_ROCM_USE_AITER_MOE=0

# Workaround 2: Use known-good image
docker pull rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103

Long Warmup Time

FP8 BMM kernel pre-compilation takes ~3 minutes on first run. Subsequent starts use cached kernels.

If unacceptable:

                            bash
                            
export VLLM_USE_DEEP_GEMM=0

Model-Specific Issues

Kimi-K2.5: Architecture Not Supported

Model architectures ['KimiK25ForConditionalGeneration'] are not supported for now.

Kimi-K2.5 requires the nightly vLLM build, not the stable release:

bash

# Change from stable:
rocm/vllm:rocm6.3.4_mi325_ubuntu22.04_py3.12_vllm_0.9.0.1
# To nightly:
rocm/vllm-dev:nightly

Kimi-K2.5: Quantization Method Mismatch

Quantization method specified in the model config (compressed-tensors) does not match
the quantization method specified in the argument (fp8).

Kimi-K2.5 uses native INT4 Quantization-Aware Training (QAT), stored as compressed-tensors. Do not use --quantization fp8:

                            bash
                            
                        
# Wrong - causes error
--quantization fp8

# Correct - let vLLM auto-detect
# (simply omit the --quantization flag)

Kimi-K2.5: AITER MLA Head Count Error

Aiter MLA only supports 16 or 128 number of heads. Provided 8 number of heads.

Kimi-K2.5 has 64 attention heads. With TP=8: 64 / 8 = 8 heads per GPU (unsupported by AITER MLA).

Solution: Disable AITER and use TP=4:

                            bash
                            
                        
export VLLM_ROCM_USE_AITER=0
--tensor-parallel-size 4

Note

This means Kimi-K2.5 uses only 4 of 8 GPUs. Throughput is not directly comparable to TP=8 models.

Kimi-K2.5: MXFP4 Concerns

Initial reports suggested Kimi-K2.5 uses MXFP4 format requiring MI350+ hardware. This is incorrect.

Kimi-K2.5 uses INT4 Quantization-Aware Training (QAT) - the model was trained from scratch at INT4 precision, not post-training quantized. This is fully compatible with MI325X.

Kimi-K2.5: Working Configuration

Verified working configuration:

                            bash
                            
                        
docker run --rm \
  --name vllm-kimi-k25 \
  --ipc=host \
  --network=host \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "VLLM_ROCM_USE_AITER=0" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  rocm/vllm-dev:nightly \
  vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --block-size 1 \
  --mm-encoder-tp-mode data

Key flags:

VLLM_ROCM_USE_AITER=0 - Disables AITER (avoids MLA head count issues)
--tensor-parallel-size 4 - Required due to AITER constraints
--block-size 1 - Required for MLA backend on ROCm
--mm-encoder-tp-mode data - Vision encoder parallelism
No --quantization flag - Uses model's native INT4 compressed-tensors

DeepSeek: KV Cache Offloading Not Supported

KeyError: 'model.layers.0.self_attn.indexer.k_cache'

DeepSeek V3.2 uses MLA (Multi-head Latent Attention) which is incompatible with vLLM's KV cache offloading. The MLA architecture uses an indexer-based KV cache that the OffloadingConnector cannot handle.

Solution: Use the large HBM capacity instead (256GB per MI325X is sufficient).

DeepSeek: FP8 KV Cache Not Supported

Error: ROCMAiterMLASparseBackend doesn't support fp8 kv_cache_dtype

Do not use --kv-cache-dtype fp8 with DeepSeek models. vLLM automatically uses the correct fp8_ds_mla format.

DeepSeek: Chat Endpoint Returns Error

Add the tokenizer mode flag:

bash

--tokenizer-mode deepseek_v32

DeepSeek: Using Completions API

DeepSeek V3.2 works best with the completions API (no chat template):

                            bash
                            
                        
# Use completions endpoint
curl http://localhost:8000/v1/completions \
  -d '{"model": "deepseek-ai/DeepSeek-V3-0324", "prompt": "Hello", "max_tokens": 100}'

For chat format, ensure proper tokenizer configuration with --tokenizer-mode deepseek_v32.

GEMM Kernel Error (Large MoE Models)

RuntimeError: wrong! device_gemm with the specified compilation parameters does not support this GEMM problem

This occurs with large models (Llama-405B, Qwen3-VL-235B) when AITER GEMM kernels encounter incompatible configurations.

Fix: Add KV offloading flags, even if you don't need the extra memory:

                            bash
                            
--kv-offloading-backend native \n--kv-offloading-size 64 \n--disable-hybrid-kv-cache-manager

Docker Issues

Permission Denied on /dev/kfd

                            bash
                            
                        
# Ensure video group is added
docker run ... --group-add=video ...

# Host user must be in video group
sudo usermod -aG video $USER

Container Crashes Silently

Ensure security options are set:

                            bash
                            
--cap-add=SYS_PTRACE \n--security-opt seccomp=unconfined

IPC Errors with Multi-GPU

Shared memory is required for multi-GPU communication:

bash

--ipc=host

Performance Questions

GPU Utilization Appears Low

Despite high throughput (e.g., 47K tok/s), GPU compute utilization may show only 5-10% on dashboards. This is expected behavior:

MoE Sparse Activation - Models like Qwen3-VL activate only 5-10% of parameters per token
Memory Bandwidth Bound - LLM inference is limited by HBM bandwidth, not compute
Efficient Batching - vLLM batches requests efficiently, reducing GPU stalls

How to verify actual saturation:

Watch for KV cache warnings: "cannot store X blocks"
Throughput plateaus despite increased concurrency
Consistent latency under heavy load

Throughput Lower Than Expected

If throughput is significantly lower than benchmarks:

Check AITER is enabled: VLLM_ROCM_USE_AITER=1
Verify tensor parallelism: Most models use TP=8, Kimi-K2.5 requires TP=4 with AITER disabled
Check for KV cache pressure: Add --kv-offloading-size for GQA models
Ensure no CPU bottleneck: Use --ipc=host for shared memory

Quick Reference

Error	Likely Cause	Fix
GPU not visible	Permissions	`--group-add=video`, user groups
Multi-GPU hang	NUMA balancing	`sysctl kernel.numa_balancing=0`
OOM on load	Model too large	`--quantization fp8`, reduce `--max-model-len`
DeepSeek crash	VSKIP enabled	`AITER_ENABLE_VSKIP=0`
Block size error	MLA model	`--block-size 1`
KV offload error	MLA model	Don't use KV offloading
GEMM kernel error	Large MoE	Add KV offloading flags
Chat endpoint error	Missing tokenizer	`--tokenizer-mode deepseek_v32`
Kimi architecture not supported	Stable vLLM	Use `rocm/vllm-dev:nightly`
Kimi quantization mismatch	FP8 flag	Remove `--quantization fp8`
Kimi head count error (8 heads)	AITER + TP=8	`VLLM_ROCM_USE_AITER=0` + TP=4
Low GPU utilization	Memory bound	Normal for LLM inference

Troubleshooting

GPU and System Issues

GPU Not Visible

Multi-GPU Hanging

Memory Issues

OOM During Model Load

OOM During Inference

AITER Issues

DeepSeek Crash on Startup

DeepSeek Block Size Error

MLA Accuracy Loss

MoE Performance Regression

Long Warmup Time

Model-Specific Issues

Kimi-K2.5: Architecture Not Supported

Kimi-K2.5: Quantization Method Mismatch

Kimi-K2.5: AITER MLA Head Count Error

Kimi-K2.5: MXFP4 Concerns

Kimi-K2.5: Working Configuration

DeepSeek: KV Cache Offloading Not Supported

DeepSeek: FP8 KV Cache Not Supported

DeepSeek: Chat Endpoint Returns Error

DeepSeek: Using Completions API

GEMM Kernel Error (Large MoE Models)

Docker Issues

Permission Denied on /dev/kfd

Container Crashes Silently

IPC Errors with Multi-GPU

Performance Questions

GPU Utilization Appears Low

Throughput Lower Than Expected

Quick Reference

Comments

Products

Features

Solutions

Marketplace

Resources

Company