AITER Configuration

Updated on 11 March, 2026

Configure AMD's AI Tensor Engine for ROCm (AITER) to accelerate vLLM inference.


Overview

AITER provides optimized kernels for AMD Instinct GPUs, offering:

Workload Improvement
DeepSeek V3/R1 2.1x faster
Block-scale GEMM 2x faster
Fused MoE 3x faster

Source: AMD documentation. Actual performance varies by workload, model size, and configuration.

Enable AITER

bash
export VLLM_ROCM_USE_AITER=1

Or in Docker:

bash
docker run --rm \
  --env "VLLM_ROCM_USE_AITER=1" \
  ...

Default: AITER is OFF (0)

Component Flags

When VLLM_ROCM_USE_AITER=1, these components are automatically enabled:

Flag Purpose
VLLM_ROCM_USE_AITER_LINEAR Quantization + GEMM
VLLM_ROCM_USE_AITER_MOE Fused Mixture-of-Experts
VLLM_ROCM_USE_AITER_RMSNORM Accelerated normalization
VLLM_ROCM_USE_AITER_MHA Multi-Head Attention
VLLM_ROCM_USE_AITER_MLA Multi-head Latent Attention
VLLM_ROCM_USE_AITER_FP8BMM FP8 batched matmul

Optional Flags (Default OFF)

Flag Purpose
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM FP4 assembly GEMM
VLLM_ROCM_USE_SKINNY_GEMM Small batch optimization

Model-Specific Configuration

Llama 3.1 (405B)

bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 32768

DeepSeek Models (MLA Architecture)

Warning
Required Configuration
bash
export VLLM_ROCM_USE_AITER=1
export AITER_ENABLE_VSKIP=0  # CRITICAL - prevents crashes
And use --block-size 1 (mandatory for MLA).
bash
VLLM_ROCM_USE_AITER=1 \
AITER_ENABLE_VSKIP=0 \
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 8 \
  --block-size 1 \
  --quantization fp8

Kimi-K2.5 (1T MoE)

Warning
Kimi-K2.5's MLA architecture has 64 attention heads. AITER MLA only supports 16 or 128 heads per GPU. Even with TP=4 (giving 16 heads), AITER has compatibility issues with this model.
bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_ROCM_USE_AITER=0" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  rocm/vllm-dev:nightly \
  vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --block-size 1 \
  --mm-encoder-tp-mode data

Key differences from other models:

  • VLLM_ROCM_USE_AITER=0 - AITER disabled (not enabled)
  • --tensor-parallel-size 4 - Only uses 4 GPUs (not 8)
  • Requires rocm/vllm-dev:nightly image

Qwen3-VL (235B)

bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --kv-offloading-backend native \
  --kv-offloading-size 64 \
  --disable-hybrid-kv-cache-manager

Vision-Language Models

bash
VLLM_ROCM_USE_AITER=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8

Configuration Matrix (Verified on MI325X)

Model AITER MHA MLA MOE Block Size TP
Llama 3.1 (405B) 1 1 0 0 16 8
Qwen3-VL (235B) 1 1 0 1 16 8
DeepSeek V3.2 (685B) 1 0 1 1 1 8
Kimi-K2.5 (1T) 0 1 4

Kimi-K2.5 requires AITER disabled (VLLM_ROCM_USE_AITER=0) due to MXFP4 hardware requirements and attention head count incompatibility; component flags are therefore not applicable.

Verify AITER is Active

Check the startup logs:

bash
VLLM_ROCM_USE_AITER=1 vllm serve MODEL_NAME 2>&1 | grep -i "aiter\|attention"

Look for:

  • "Using AITER MHA backend"
  • "Using AITER MLA backend"
  • "AITER MOE enabled"

Troubleshooting

DeepSeek Block Size Error

ValueError: Block size must be 1 for MLA models

MLA (Multi-head Latent Attention) architecture requires block size of 1. Always include:

bash
--block-size 1

AITER VSKIP Crashes (DeepSeek)

# Error: Server crashes during inference with no clear error message

Fix: Set AITER_ENABLE_VSKIP=0:

bash
export AITER_ENABLE_VSKIP=0

Root cause: AITER_ENABLE_VSKIP defaults to true when unset, which causes crashes on MI300X/MI325X with DeepSeek models.

MLA Accuracy Loss

Reported with certain DP/TP configurations:

bash
# Workaround: Disable AITER MLA
export VLLM_ROCM_USE_AITER_MLA=0

MoE Performance Regression

Some AITER versions have MoE regressions:

bash
# Workaround 1: Disable AITER MOE
export VLLM_ROCM_USE_AITER_MOE=0

# Workaround 2: Use known-good image
docker pull rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103

GEMM Kernel Error

RuntimeError: wrong! device_gemm with the specified compilation parameters does not support this GEMM problem

For MoE models like Qwen3-VL, add KV offloading flags:

bash
--kv-offloading-backend native \
--kv-offloading-size 64 \
--disable-hybrid-kv-cache-manager

Long Warmup Time

FP8 BMM kernel pre-compilation takes ~3 minutes on first run:

bash
# If unacceptable, disable DeepGEMM
export VLLM_USE_DEEP_GEMM=0

Quick Reduce (Multi-GPU Optimization)

For large tensor parallel configurations:

bash
# Quantized all-reduce
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION="FP"

# BF16 to FP16 cast for performance
export VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16=1

Beneficial for TP > 4 and high concurrency.

Comments