Llama 3.1 (405B) Stress Testing

Updated on 11 March, 2026

Comprehensive stress testing of Llama-3.1-405B-Instruct on 8x AMD Instinct MI325X GPUs.


Concurrency Scaling

Llama 405B Scaling

Scaling Results

Concurrency Throughput Output tok/s p99 Latency
5 423 tok/s 153 6.15s
10 851 tok/s 322 6.23s
25 1,953 tok/s 738 6.80s
50 3,428 tok/s 1,296 7.75s
75 4,565 tok/s 1,726 8.71s
100 5,254 tok/s 1,986 10.08s
150 6,927 tok/s 2,619 11.45s
200 6,464 tok/s 2,444 14.49s

Observations:

  • Throughput scales well up to 150 concurrent requests
  • Performance peaks at 150 concurrent (6,927 tok/s) before slight degradation at 200
  • p99 latency increases from 6.15s to 14.49s across the scaling range
  • Dense architecture provides consistent, predictable performance

Stress Tests

Llama 405B Stress Tests

Stress Test Results

Test Type Concurrency Throughput Output tok/s p99 Latency
Long Output (1000 tokens) text 50 1,516 tok/s 1,224 19.02s
Long Context (4K) text 25 8,240 tok/s 627 7.52s
Very Long Context (8K) text 12 6,794 tok/s 268 7.22s

Key findings:

  • Long context prefill efficient: 8,240 tok/s with 4K context
  • Very long context (8K): 6,794 tok/s total throughput
  • Long output generation: 1,224 tok/s output throughput
  • All tests passed with 100% success rate

Saturation Testing

Llama 405B Saturation

Extreme Load Results

Concurrency Throughput Output tok/s p99 Latency Status
150 10,320 tok/s 2,406 6.15s OK
200 11,519 tok/s 2,685 7.35s OK
300 13,937 tok/s 3,249 9.09s OK
500 15,944 tok/s 3,673 11.78s PEAK
750 15,693 tok/s 3,658 12.01s SATURATED
1000 15,319 tok/s 3,536 12.28s SATURATED

Observations:

  • Peak throughput of 15,944 tok/s achieved at 500 concurrent
  • Saturation begins at 750 concurrent (throughput plateau)
  • 100% success rate maintained even under extreme load
  • System remains stable at 1,000 concurrent requests

Recommendations

Use Case Concurrency Expected Throughput
Low latency 5–10 400–850 tok/s
Balanced 25–50 2,000–3,400 tok/s
High throughput 100–150 5,200–6,900 tok/s
Maximum throughput 500 15,944 tok/s

Test Configuration

Parameter Value
Model meta-llama/Llama-3.1-405B-Instruct
Precision FP8
Tensor Parallelism 8
GPUs 8x MI325X (256GB each)
Total VRAM 2 TB
Test Mode Thorough (3x multiplier)

Launch Command

bash
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768

Test Environment

Specification Value
GPU 8x AMD Instinct MI325X
VRAM 256 GB HBM3E per GPU (2 TB total)
Architecture CDNA 3 (gfx942)
ROCm 6.4.2-120
vLLM 0.14.1

Comments