Inference Cookbook for ROCm

Updated on 17 March, 2026

Production-ready deployment guide for running large language models on AMD Instinct MI325X GPUs using vLLM.

What You'll Learn

This cookbook provides tested, working configurations for deploying LLMs on AMD hardware:

Environment Setup - ROCm, Docker, and system configuration
Model Deployment - Step-by-step guides for DeepSeek V3.2, Qwen3, and more
Performance Optimization - FP8 quantization, KV cache tuning, AITER configuration
Benchmarks - Real throughput numbers from MI325X testing

Tested Hardware

All configurations in this cookbook have been verified on:

Specification	Value
GPU	8x AMD Instinct MI325X
VRAM	256 GB HBM3E per GPU (2 TB total)
Architecture	CDNA 3 (gfx942)
ROCm	6.4.2
vLLM	0.14.1

Note

The admin account is created only during the first startup when the pb_data directory is empty. If you already have an existing database and want to reinitialize the admin account, remove the data directory before starting the container:

MI325X Capabilities

Specification	Value	Impact
256 GB HBM3e	Per GPU	Run 1T+ models; 1000+ concurrent requests
6.0 TB/s bandwidth	Per GPU	High throughput (LLMs are memory-bound)
2 TB cluster (8x)	Total	No KV offloading; full BF16 for 235B models

Models Covered

Model	Parameters	Precision	Status
Kimi-K2.5	1T (32B active)	INT4 QAT	Verified
DeepSeek V3.2	685B	FP8	Verified
Llama-3.1-405B	405B	FP8	Verified
Qwen3-VL-235B	235B (22B active)	BF16	Verified

Quick Start

Get a model running in under 5 minutes:

                            bash
                            
                        
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-0.6B \

Test the endpoint:

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello!"}]}'

Key Findings

From our comprehensive testing on MI325X:

FP8 quantization enables running large models (405B+) that wouldn't fit in BF16
Throughput scaling up to 50–100 concurrent requests per model, with stable performance through 1,000
Architecture matters - GQA models (Qwen3) significantly outperform MLA models (DeepSeek) in throughput
KV cache offloading works with GQA models but NOT with MLA (DeepSeek, Kimi)

Peak Performance Achieved

Model	Peak Throughput	p99 Latency	Best For
Qwen3-VL-235B	11,218 tok/s	15.43s	Vision, high-volume batch
Llama-3.1-405B	6,808 tok/s	25.83s	Consistent latency, long context
DeepSeek V3.2	5,786 tok/s	23.01s	Reasoning, tool calling
Kimi-K2.5	952 tok/s	182.52s	Vision, tool calling (TP=4)

Multi-run means (n=5). Peak throughput measured at optimal concurrency per model.

Next Steps

Hardware Requirements - Check your system meets requirements
Environment Setup - Configure ROCm and Docker
First Deployment - Deploy your first model

Getting Started

Get started running vLLM on AMD Instinct GPUs with hardware requirements, environment setup, and your first model deployment.

Hardware Requirements

Minimum and recommended specifications for running vLLM on AMD Instinct GPUs.

Environment Setup

Configure your system for running vLLM on AMD Instinct GPUs.

First Deployment

Deploy your first model on AMD Instinct GPUs.

Model Guides

Deploy large AI models including DeepSeek V3.2, Llama 3.1, Qwen3-VL, and Kimi-K2.5 on AMD Instinct GPUs.

DeepSeek V3.2 (685B)

Deploy DeepSeek V3.2 (685B parameters) on AMD Instinct GPUs.

Llama 3.1 (405B)

Deploy Meta's Llama-3.1-405B-Instruct on AMD Instinct GPUs.

Qwen3-VL (235B)

Deploy Qwen3-VL-235B-A22B-Instruct (Vision-Language model) on AMD Instinct GPUs.

Kimi-K2.5 (1T)

Deploy Kimi-K2.5 (1 trillion parameters) on AMD Instinct GPUs.

Optimization

Improve LLM inference performance on AMD Instinct GPUs with FP8 quantization, KV cache optimization, and concurrency tuning.

FP8 Quantization

Reduce memory usage and improve throughput with FP8 quantization on AMD Instinct GPUs.

KV Cache Offloading

Extend effective memory by offloading KV cache to CPU memory.

Concurrency Tuning

Maximize throughput by tuning vLLM for high concurrent request loads.

AITER Configuration

Configure AMD's AI Tensor Engine for ROCm (AITER) to accelerate vLLM inference.

Benchmarks

Detailed benchmarking of DeepSeek, Llama, Qwen3-VL, and Kimi models on AMD Instinct MI325X GPUs with stress and validation testing.

Stress Testing

Validation Testing

Benchmarking Guide

This guide explains the methodology used for all benchmark results in this documentation, and provides the scripts to reproduce them.

Multi-Run Statistical Analysis

All results below are aggregated from 5 independent benchmark runs per model on 8x AMD Instinct MI325X GPUs. Each run used 100 requests per concurrency level with 2,048 input tokens and 512 output tokens.

AITER Ablation Study

AITER (AMD Inference and Training Engine for ROCm) provides optimized attention kernels for AMD GPUs. This study measures its impact on inference throughput across model architectures.

GPU Memory Profiling

Detailed GPU memory measurements for all 4 models running on AMD Instinct MI325X GPUs (256 GB HBM3e per GPU). Measurements taken via `rocm-smi` after model loading and warmup completion.

Concurrency Saturation Analysis

Fine-grained concurrency sweep from 500 to 1,000 concurrent requests (step 50) to identify the exact saturation knee for each model. Each concurrency level was tested across 3 independent runs with 200 requests per level.

GPU Power and Thermal Monitoring

Real-time GPU monitoring data collected via `rocm-smi` during Kimi-K2.5 benchmark runs on 8x AMD Instinct MI325X GPUs. Data was sampled at 1-second intervals across 3 independent runs.

Benchmark Methodology

Complete documentation of the benchmark methodology, test environment, and tooling validation used for all results in this cookbook.

Troubleshooting

Common issues and verified solutions for vLLM on AMD Instinct GPUs.

Production Deployment

Deploy vLLM with health checks, monitoring, and resilience on AMD Instinct GPUs.