Minimum and recommended specifications for running vLLM on AMD Instinct GPUs.
Supported GPUs
GPU
HBM
Memory Bandwidth
Architecture
MI300X
192 GB HBM3
5.3 TB/s
CDNA 3 (gfx942)
MI325X
256 GB HBM3E
6.0 TB/s
CDNA 3 (gfx942)
Both GPUs share the same architecture and use identical vLLM configurations.
MI325X Specifications
Specification
MI325X
8-GPU Cluster
HBM3e Capacity
256 GB
2 TB total
Memory Bandwidth
6.0 TB/s
48 TB/s total
FP16 Compute
1,307 TFLOPS
10.5 PFLOPS
What This Enables (Verified in Our Testing)
Capability
MI325X (8x)
Why It Matters
1T model (Kimi-K2.5)
Fits with INT4 QAT
Largest open MoE model
685B model in FP8
Uses 83 GB of 2 TB
96% headroom for KV cache
1000 concurrent requests
100% success rate
Massive batch capacity
No KV offloading needed
Fits entirely in HBM
Lower latency, simpler config
BF16 for 235B models
Fits without quantization
Maximum accuracy when needed
Note
LLM inference performance scales with memory bandwidth, not compute. MI325X's 6.0 TB/s bandwidth directly translates to higher token throughput, especially for large batch sizes where memory access patterns dominate.