Focus Mode

Llama 3.1 (405B) Stress Testing

Updated on 11 March, 2026

Comprehensive stress testing of Llama-3.1-405B-Instruct on 8x AMD Instinct MI325X GPUs.

Concurrency Scaling

Llama 405B Scaling

Scaling Results

Concurrency	Throughput	Output tok/s	p99 Latency
5	423 tok/s	153	6.15s
10	851 tok/s	322	6.23s
25	1,953 tok/s	738	6.80s
50	3,428 tok/s	1,296	7.75s
75	4,565 tok/s	1,726	8.71s
100	5,254 tok/s	1,986	10.08s
150	6,927 tok/s	2,619	11.45s
200	6,464 tok/s	2,444	14.49s

Observations:

Throughput scales well up to 150 concurrent requests
Performance peaks at 150 concurrent (6,927 tok/s) before slight degradation at 200
p99 latency increases from 6.15s to 14.49s across the scaling range
Dense architecture provides consistent, predictable performance

Stress Tests

Llama 405B Stress Tests

Stress Test Results

Test Type	Concurrency	Throughput	Output tok/s	p99 Latency
Long Output (1000 tokens) text	50	1,516 tok/s	1,224	19.02s
Long Context (4K) text	25	8,240 tok/s	627	7.52s
Very Long Context (8K) text	12	6,794 tok/s	268	7.22s

Key findings:

Long context prefill efficient: 8,240 tok/s with 4K context
Very long context (8K): 6,794 tok/s total throughput
Long output generation: 1,224 tok/s output throughput
All tests passed with 100% success rate

Saturation Testing

Llama 405B Saturation

Extreme Load Results

Concurrency	Throughput	Output tok/s	p99 Latency	Status
150	10,320 tok/s	2,406	6.15s	OK
200	11,519 tok/s	2,685	7.35s	OK
300	13,937 tok/s	3,249	9.09s	OK
500	15,944 tok/s	3,673	11.78s	PEAK
750	15,693 tok/s	3,658	12.01s	SATURATED
1000	15,319 tok/s	3,536	12.28s	SATURATED

Observations:

Peak throughput of 15,944 tok/s achieved at 500 concurrent
Saturation begins at 750 concurrent (throughput plateau)
100% success rate maintained even under extreme load
System remains stable at 1,000 concurrent requests

Recommendations

Use Case	Concurrency	Expected Throughput
Low latency	5–10	400–850 tok/s
Balanced	25–50	2,000–3,400 tok/s
High throughput	100–150	5,200–6,900 tok/s
Maximum throughput	500	15,944 tok/s

Test Configuration

Parameter	Value
Model	meta-llama/Llama-3.1-405B-Instruct
Precision	FP8
Tensor Parallelism	8
GPUs	8x MI325X (256GB each)
Total VRAM	2 TB
Test Mode	Thorough (3x multiplier)

Launch Command

                            bash
                            
                        
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768

Test Environment

Specification	Value
GPU	8x AMD Instinct MI325X
VRAM	256 GB HBM3E per GPU (2 TB total)
Architecture	CDNA 3 (gfx942)
ROCm	6.4.2-120
vLLM	0.14.1

Llama 3.1 (405B) Stress Testing

Concurrency Scaling

Scaling Results

Stress Tests

Stress Test Results

Saturation Testing

Extreme Load Results

Recommendations

Test Configuration

Launch Command

Test Environment

Comments

Products

Features

Solutions

Marketplace

Resources

Company