Benchmarks

Updated on 12 March, 2026

Detailed benchmarking of DeepSeek, Llama, Qwen3-VL, and Kimi models on AMD Instinct MI325X GPUs with stress and validation testing.


This guide explains the methodology used for all benchmark results in this documentation, and provides the scripts to reproduce them.
All results below are aggregated from 5 independent benchmark runs per model on 8x AMD Instinct MI325X GPUs. Each run used 100 requests per concurrency level with 2,048 input tokens and 512 output tokens.
AITER (AMD Inference and Training Engine for ROCm) provides optimized attention kernels for AMD GPUs. This study measures its impact on inference throughput across model architectures.
Detailed GPU memory measurements for all 4 models running on AMD Instinct MI325X GPUs (256 GB HBM3e per GPU). Measurements taken via `rocm-smi` after model loading and warmup completion.
Fine-grained concurrency sweep from 500 to 1,000 concurrent requests (step 50) to identify the exact saturation knee for each model. Each concurrency level was tested across 3 independent runs with 200 requests per level.
Real-time GPU monitoring data collected via `rocm-smi` during Kimi-K2.5 benchmark runs on 8x AMD Instinct MI325X GPUs. Data was sampled at 1-second intervals across 3 independent runs.
Complete documentation of the benchmark methodology, test environment, and tooling validation used for all results in this cookbook.