AITER (AMD Inference and Training Engine for ROCm) provides optimized attention kernels for AMD GPUs. This study measures its impact on inference throughput across model architectures.
AITER replaces standard attention kernels with ROCm-optimized implementations. Its effectiveness varies by architecture:
| Architecture | AITER Support | Notes |
|---|---|---|
| GQA (Llama-405B) | Toggle-able | Can enable/disable for A/B testing |
| MLA (DeepSeek V3.2) | Required | vLLM depends on AITER's sparse attention indexer |
| MLA (Kimi-K2.5) | Disabled | Head count incompatibility with TP=4 |
| GQA + Vision (Qwen3-VL) | Default | Uses standard attention path |
Llama-3.1-405B is the only model where AITER can be cleanly toggled, making it ideal for an A/B comparison. Each configuration was tested across 5 independent runs with 100 requests per concurrency level.

| Concurrency | AITER Disabled (tok/s) | AITER Enabled (tok/s) | Difference (%) |
|---|---|---|---|
| 1 | 137 | 150 | +10.01% |
| 5 | 554 | 549 | -1.05% |
| 10 | 1,092 | 1,084 | -0.77% |
| 50 | 4,380 | 4,340 | -0.91% |
| 100 | 6,682 | 6,955 | +4.08% |
| 200 | 6,663 | 6,871 | +3.13% |
| 500 | 6,676 | 6,972 | +4.43% |
Findings:
Recommendation: Enable AITER for production workloads at high concurrency. The 3-4% throughput gain is meaningful at scale, and the higher variance is acceptable for batch processing.
Kimi-K2.5 runs with VLLM_ROCM_USE_AITER=0 because:
Running without AITER, Kimi-K2.5 achieves stable inference at ~950 tok/s peak throughput.