Production-ready deployment guide for running large language models on AMD Instinct MI325X GPUs using vLLM.
This cookbook provides tested, working configurations for deploying LLMs on AMD hardware:
All configurations in this cookbook have been verified on:
| Specification | Value |
|---|---|
| GPU | 8x AMD Instinct MI325X |
| VRAM | 256 GB HBM3E per GPU (2 TB total) |
| Architecture | CDNA 3 (gfx942) |
| ROCm | 6.4.2 |
| vLLM | 0.14.1 |
pb_data directory is empty. If you already have an existing database and want to reinitialize the admin account, remove the data directory before starting the container:
| Specification | Value | Impact |
|---|---|---|
| 256 GB HBM3e | Per GPU | Run 1T+ models; 1000+ concurrent requests |
| 6.0 TB/s bandwidth | Per GPU | High throughput (LLMs are memory-bound) |
| 2 TB cluster (8x) | Total | No KV offloading; full BF16 for 235B models |
| Model | Parameters | Precision | Status |
|---|---|---|---|
| Kimi-K2.5 | 1T (32B active) | INT4 QAT | Verified |
| DeepSeek V3.2 | 685B | FP8 | Verified |
| Llama-3.1-405B | 405B | FP8 | Verified |
| Qwen3-VL-235B | 235B (22B active) | BF16 | Verified |
Get a model running in under 5 minutes:
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-0.6B \
Test the endpoint:
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello!"}]}'
From our comprehensive testing on MI325X:
| Model | Peak Throughput | p99 Latency | Best For |
|---|---|---|---|
| Qwen3-VL-235B | 11,218 tok/s | 15.43s | Vision, high-volume batch |
| Llama-3.1-405B | 6,808 tok/s | 25.83s | Consistent latency, long context |
| DeepSeek V3.2 | 5,786 tok/s | 23.01s | Reasoning, tool calling |
| Kimi-K2.5 | 952 tok/s | 182.52s | Vision, tool calling (TP=4) |
Multi-run means (n=5). Peak throughput measured at optimal concurrency per model.
Deploy DeepSeek V3.2 (685B parameters) on AMD Instinct GPUs.
Deploy Meta's Llama-3.1-405B-Instruct on AMD Instinct GPUs.
Deploy Qwen3-VL-235B-A22B-Instruct (Vision-Language model) on AMD Instinct GPUs.
Deploy Kimi-K2.5 (1 trillion parameters) on AMD Instinct GPUs.
Reduce memory usage and improve throughput with FP8 quantization on AMD Instinct GPUs.
Extend effective memory by offloading KV cache to CPU memory.
Maximize throughput by tuning vLLM for high concurrent request loads.
Configure AMD's AI Tensor Engine for ROCm (AITER) to accelerate vLLM inference.
This guide explains the methodology used for all benchmark results in this documentation, and provides the scripts to reproduce them.
All results below are aggregated from 5 independent benchmark runs per model on 8x AMD Instinct MI325X GPUs. Each run used 100 requests per concurrency level with 2,048 input tokens and 512 output tokens.
AITER (AMD Inference and Training Engine for ROCm) provides optimized attention kernels for AMD GPUs. This study measures its impact on inference throughput across model architectures.
Detailed GPU memory measurements for all 4 models running on AMD Instinct MI325X GPUs (256 GB HBM3e per GPU). Measurements taken via `rocm-smi` after model loading and warmup completion.
Fine-grained concurrency sweep from 500 to 1,000 concurrent requests (step 50) to identify the exact saturation knee for each model. Each concurrency level was tested across 3 independent runs with 200 requests per level.
Real-time GPU monitoring data collected via `rocm-smi` during Kimi-K2.5 benchmark runs on 8x AMD Instinct MI325X GPUs. Data was sampled at 1-second intervals across 3 independent runs.
Complete documentation of the benchmark methodology, test environment, and tooling validation used for all results in this cookbook.