Focus Mode

Hardware Requirements

Updated on 11 March, 2026

Minimum and recommended specifications for running vLLM on AMD Instinct GPUs.

Supported GPUs

GPU	HBM	Memory Bandwidth	Architecture
MI300X	192 GB HBM3	5.3 TB/s	CDNA 3 (gfx942)
MI325X	256 GB HBM3E	6.0 TB/s	CDNA 3 (gfx942)

Both GPUs share the same architecture and use identical vLLM configurations.

MI325X Specifications

Specification	MI325X	8-GPU Cluster
HBM3e Capacity	256 GB	2 TB total
Memory Bandwidth	6.0 TB/s	48 TB/s total
FP16 Compute	1,307 TFLOPS	10.5 PFLOPS

What This Enables (Verified in Our Testing)

Capability	MI325X (8x)	Why It Matters
1T model (Kimi-K2.5)	Fits with INT4 QAT	Largest open MoE model
685B model in FP8	Uses 83 GB of 2 TB	96% headroom for KV cache
1000 concurrent requests	100% success rate	Massive batch capacity
No KV offloading needed	Fits entirely in HBM	Lower latency, simpler config
BF16 for 235B models	Fits without quantization	Maximum accuracy when needed

Note

LLM inference performance scales with memory bandwidth, not compute. MI325X's 6.0 TB/s bandwidth directly translates to higher token throughput, especially for large batch sizes where memory access patterns dominate.

Specifications from AMD MI325X product page.

Per-GPU Weight Memory by Model

Model	Per-GPU Weights	Precision	Minimum GPUs
DeepSeek V3.2 (685B)	~83 GB	FP8	8x MI325X
Llama-3.1-405B	~112 GB	FP8	8x MI325X
Qwen3-VL-235B	~58 GB	BF16 (FP8 incompatible)	8x MI325X
Kimi-K2.5 (1T)	~145 GB	INT4 QAT	4x MI325X (TP=4)

Model Load Times (MI325X Verified)

Model	Precision	Load Time	Notes
DeepSeek V3.2	FP8	~70s (~1.2 min)	+3 min FP8 warmup on first run
Llama-3.1-405B	FP8	~81s (~1.4 min)	Dense model, varies 41–195s across runs
Qwen3-VL-235B	BF16	~73s (~1.2 min)	FP8 not supported
Kimi-K2.5	INT4 QAT	~152s (~2.5 min)	TP=4, nightly vLLM required

Note

Memory Advantage

MI325X's 256GB HBM capacity often eliminates the need for memory optimization techniques that smaller GPUs require.

System Requirements

Component	Minimum	Recommended
ROCm	6.2.x	6.4.x or 7.0.x
Docker	24.0+	29.x
System RAM	64 GB	256 GB+
CPU Cores	16	64+

Verified Software Stack

This cookbook was tested with the following versions:

Component	Version	Notes
ROCm	6.4.2-120	Verified working
vLLM	0.14.1	Latest stable
Docker	29.1.5	With ROCm support
RCCL	2.26.6	Multi-GPU communication
Python	3.10+	Inside container

NUMA Configuration

For multi-GPU systems, NUMA balancing must be disabled:

                            bash
                            
# Check current setting
cat /proc/sys/kernel/numa_balancing

# Should return 0 (disabled)

If enabled, disable it:

                            bash
                            
sudo sysctl kernel.numa_balancing=0

# Persist across reboots
echo "kernel.numa_balancing=0" | sudo tee /etc/sysctl.d/99-numa.conf

Verification Commands

                            bash
                            
                        
# Check GPU visibility
rocm-smi --showproductname

# Check VRAM
rocm-smi --showmeminfo vram

# Check ROCm version
cat /opt/rocm/.info/version

# Check GPU topology
rocm-smi --showtopo

Example Output (8x MI325X)

GPU[0] : gfx942:sramecc-:xnack- : AMD Instinct MI325X
GPU[1] : gfx942:sramecc-:xnack- : AMD Instinct MI325X
...
GPU[7] : gfx942:sramecc-:xnack- : AMD Instinct MI325X

VRAM Total: 256 GB per GPU (2 TB total)

Hardware Requirements

Supported GPUs

MI325X Specifications

What This Enables (Verified in Our Testing)

Per-GPU Weight Memory by Model

Model Load Times (MI325X Verified)

System Requirements

Verified Software Stack

NUMA Configuration

Verification Commands

Example Output (8x MI325X)

Comments

Products

Features

Solutions

Marketplace

Resources

Company