Hardware Requirements

Updated on 11 March, 2026

Minimum and recommended specifications for running vLLM on AMD Instinct GPUs.


Supported GPUs

GPU HBM Memory Bandwidth Architecture
MI300X 192 GB HBM3 5.3 TB/s CDNA 3 (gfx942)
MI325X 256 GB HBM3E 6.0 TB/s CDNA 3 (gfx942)

Both GPUs share the same architecture and use identical vLLM configurations.

MI325X Specifications

Specification MI325X 8-GPU Cluster
HBM3e Capacity 256 GB 2 TB total
Memory Bandwidth 6.0 TB/s 48 TB/s total
FP16 Compute 1,307 TFLOPS 10.5 PFLOPS

What This Enables (Verified in Our Testing)

Capability MI325X (8x) Why It Matters
1T model (Kimi-K2.5) Fits with INT4 QAT Largest open MoE model
685B model in FP8 Uses 83 GB of 2 TB 96% headroom for KV cache
1000 concurrent requests 100% success rate Massive batch capacity
No KV offloading needed Fits entirely in HBM Lower latency, simpler config
BF16 for 235B models Fits without quantization Maximum accuracy when needed
Note
LLM inference performance scales with memory bandwidth, not compute. MI325X's 6.0 TB/s bandwidth directly translates to higher token throughput, especially for large batch sizes where memory access patterns dominate.

Specifications from AMD MI325X product page.

Per-GPU Weight Memory by Model

Model Per-GPU Weights Precision Minimum GPUs
DeepSeek V3.2 (685B) ~83 GB FP8 8x MI325X
Llama-3.1-405B ~112 GB FP8 8x MI325X
Qwen3-VL-235B ~58 GB BF16 (FP8 incompatible) 8x MI325X
Kimi-K2.5 (1T) ~145 GB INT4 QAT 4x MI325X (TP=4)

Model Load Times (MI325X Verified)

Model Precision Load Time Notes
DeepSeek V3.2 FP8 ~70s (~1.2 min) +3 min FP8 warmup on first run
Llama-3.1-405B FP8 ~81s (~1.4 min) Dense model, varies 41–195s across runs
Qwen3-VL-235B BF16 ~73s (~1.2 min) FP8 not supported
Kimi-K2.5 INT4 QAT ~152s (~2.5 min) TP=4, nightly vLLM required
Note
Memory Advantage
  • MI325X's 256GB HBM capacity often eliminates the need for memory optimization techniques that smaller GPUs require.

System Requirements

Component Minimum Recommended
ROCm 6.2.x 6.4.x or 7.0.x
Docker 24.0+ 29.x
System RAM 64 GB 256 GB+
CPU Cores 16 64+

Verified Software Stack

This cookbook was tested with the following versions:

Component Version Notes
ROCm 6.4.2-120 Verified working
vLLM 0.14.1 Latest stable
Docker 29.1.5 With ROCm support
RCCL 2.26.6 Multi-GPU communication
Python 3.10+ Inside container

NUMA Configuration

For multi-GPU systems, NUMA balancing must be disabled:

bash
# Check current setting
cat /proc/sys/kernel/numa_balancing

# Should return 0 (disabled)

If enabled, disable it:

bash
sudo sysctl kernel.numa_balancing=0

# Persist across reboots
echo "kernel.numa_balancing=0" | sudo tee /etc/sysctl.d/99-numa.conf

Verification Commands

bash
# Check GPU visibility
rocm-smi --showproductname

# Check VRAM
rocm-smi --showmeminfo vram

# Check ROCm version
cat /opt/rocm/.info/version

# Check GPU topology
rocm-smi --showtopo

Example Output (8x MI325X)

GPU[0] : gfx942:sramecc-:xnack- : AMD Instinct MI325X
GPU[1] : gfx942:sramecc-:xnack- : AMD Instinct MI325X
...
GPU[7] : gfx942:sramecc-:xnack- : AMD Instinct MI325X

VRAM Total: 256 GB per GPU (2 TB total)

Comments