Inference Cookbook for ROCm

Updated on 17 March, 2026

Production-ready deployment guide for running large language models on AMD Instinct MI325X GPUs using vLLM.


What You'll Learn

This cookbook provides tested, working configurations for deploying LLMs on AMD hardware:

  • Environment Setup - ROCm, Docker, and system configuration
  • Model Deployment - Step-by-step guides for DeepSeek V3.2, Qwen3, and more
  • Performance Optimization - FP8 quantization, KV cache tuning, AITER configuration
  • Benchmarks - Real throughput numbers from MI325X testing

Tested Hardware

All configurations in this cookbook have been verified on:

Specification Value
GPU 8x AMD Instinct MI325X
VRAM 256 GB HBM3E per GPU (2 TB total)
Architecture CDNA 3 (gfx942)
ROCm 6.4.2
vLLM 0.14.1
Note
The admin account is created only during the first startup when the pb_data directory is empty. If you already have an existing database and want to reinitialize the admin account, remove the data directory before starting the container:

MI325X Capabilities

Specification Value Impact
256 GB HBM3e Per GPU Run 1T+ models; 1000+ concurrent requests
6.0 TB/s bandwidth Per GPU High throughput (LLMs are memory-bound)
2 TB cluster (8x) Total No KV offloading; full BF16 for 235B models

Models Covered

Model Parameters Precision Status
Kimi-K2.5 1T (32B active) INT4 QAT Verified
DeepSeek V3.2 685B FP8 Verified
Llama-3.1-405B 405B FP8 Verified
Qwen3-VL-235B 235B (22B active) BF16 Verified

Quick Start

Get a model running in under 5 minutes:

bash
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-0.6B \

Test the endpoint:

bash
curl http://localhost:8000/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello!"}]}'

Key Findings

From our comprehensive testing on MI325X:

  • FP8 quantization enables running large models (405B+) that wouldn't fit in BF16
  • Throughput scaling up to 50–100 concurrent requests per model, with stable performance through 1,000
  • Architecture matters - GQA models (Qwen3) significantly outperform MLA models (DeepSeek) in throughput
  • KV cache offloading works with GQA models but NOT with MLA (DeepSeek, Kimi)

Peak Performance Achieved

Model Peak Throughput p99 Latency Best For
Qwen3-VL-235B 11,218 tok/s 15.43s Vision, high-volume batch
Llama-3.1-405B 6,808 tok/s 25.83s Consistent latency, long context
DeepSeek V3.2 5,786 tok/s 23.01s Reasoning, tool calling
Kimi-K2.5 952 tok/s 182.52s Vision, tool calling (TP=4)

Multi-run means (n=5). Peak throughput measured at optimal concurrency per model.

Next Steps

  1. Hardware Requirements - Check your system meets requirements
  2. Environment Setup - Configure ROCm and Docker
  3. First Deployment - Deploy your first model
Detailed benchmarking of DeepSeek, Llama, Qwen3-VL, and Kimi models on AMD Instinct MI325X GPUs with stress and validation testing.
Common issues and verified solutions for vLLM on AMD Instinct GPUs.
Deploy vLLM with health checks, monitoring, and resilience on AMD Instinct GPUs.