Comprehensive stress testing of Llama-3.1-405B-Instruct on 8x AMD Instinct MI325X GPUs.

| Concurrency | Throughput | Output tok/s | p99 Latency |
|---|---|---|---|
| 5 | 423 tok/s | 153 | 6.15s |
| 10 | 851 tok/s | 322 | 6.23s |
| 25 | 1,953 tok/s | 738 | 6.80s |
| 50 | 3,428 tok/s | 1,296 | 7.75s |
| 75 | 4,565 tok/s | 1,726 | 8.71s |
| 100 | 5,254 tok/s | 1,986 | 10.08s |
| 150 | 6,927 tok/s | 2,619 | 11.45s |
| 200 | 6,464 tok/s | 2,444 | 14.49s |
Observations:

| Test Type | Concurrency | Throughput | Output tok/s | p99 Latency |
|---|---|---|---|---|
| Long Output (1000 tokens) text | 50 | 1,516 tok/s | 1,224 | 19.02s |
| Long Context (4K) text | 25 | 8,240 tok/s | 627 | 7.52s |
| Very Long Context (8K) text | 12 | 6,794 tok/s | 268 | 7.22s |
Key findings:

| Concurrency | Throughput | Output tok/s | p99 Latency | Status |
|---|---|---|---|---|
| 150 | 10,320 tok/s | 2,406 | 6.15s | OK |
| 200 | 11,519 tok/s | 2,685 | 7.35s | OK |
| 300 | 13,937 tok/s | 3,249 | 9.09s | OK |
| 500 | 15,944 tok/s | 3,673 | 11.78s | PEAK |
| 750 | 15,693 tok/s | 3,658 | 12.01s | SATURATED |
| 1000 | 15,319 tok/s | 3,536 | 12.28s | SATURATED |
Observations:
| Use Case | Concurrency | Expected Throughput |
|---|---|---|
| Low latency | 5–10 | 400–850 tok/s |
| Balanced | 25–50 | 2,000–3,400 tok/s |
| High throughput | 100–150 | 5,200–6,900 tok/s |
| Maximum throughput | 500 | 15,944 tok/s |
| Parameter | Value |
|---|---|
| Model | meta-llama/Llama-3.1-405B-Instruct |
| Precision | FP8 |
| Tensor Parallelism | 8 |
| GPUs | 8x MI325X (256GB each) |
| Total VRAM | 2 TB |
| Test Mode | Thorough (3x multiplier) |
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 32768
| Specification | Value |
|---|---|
| GPU | 8x AMD Instinct MI325X |
| VRAM | 256 GB HBM3E per GPU (2 TB total) |
| Architecture | CDNA 3 (gfx942) |
| ROCm | 6.4.2-120 |
| vLLM | 0.14.1 |