Focus Mode

DeepSeek V3.2 (685B) Stress Testing

Updated on 11 March, 2026

Comprehensive stress testing of DeepSeek V3.2 (685B parameters) on 8x AMD Instinct MI325X GPUs.

Concurrency Scaling

Scaling Results

Concurrency	Throughput	Output tok/s	p99 Latency
5	461 tok/s	106	8.19s
10	1,235 tok/s	162	8.12s
25	3,013 tok/s	477	8.72s
50	4,607 tok/s	802	11.34s
75	4,677 tok/s	737	14.03s
100	7,160 tok/s	981	13.75s
150	6,891 tok/s	1,053	14.82s
200	7,266 tok/s	1,045	13.95s

Observations:

Strong scaling from 5 to 200 concurrent requests
Throughput increases from 461 tok/s at 5 concurrent to 7,266 tok/s at 200 concurrent
Peak scaling throughput of 7,266 tok/s at 200 concurrent
p99 latency increases moderately from 8s to 14s across the range

Stress Tests

Stress Test Results

Test Type	Concurrency	Throughput	Output tok/s	p99 Latency
Long Output (500 tokens) text	10	281 tok/s	230	22.45s
Long Context (4K) text	5	4,274 tok/s	101	8.66s
Very Long Context (8K) text	5	3,372 tok/s	120	8.78s

Key findings:

Long output generation (500 tokens): 281 tok/s with 22.5s p99 latency
Long context (4K tokens): 4,274 tok/s total throughput
Very long context (8K tokens): 3,372 tok/s total throughput
All tests passed with 100% success rate

Saturation Testing

Extreme Load Results

Concurrency	Throughput	Output tok/s	p99 Latency	Status
150	8,355 tok/s	868	5.77s	OK
200	10,864 tok/s	867	5.73s	OK
300	12,719 tok/s	1,039	7.35s	OK
500	15,343 tok/s	1,239	9.08s	PEAK
750	13,218 tok/s	1,348	10.84s	SATURATED
1000	14,148 tok/s	1,276	9.99s	OK

Observations:

Peak throughput of 15,343 tok/s achieved at 500 concurrent
Saturation begins at 750 concurrent (throughput plateau)
100% success rate maintained even under extreme load (1,000 concurrent)
MLA + AITER combination enables excellent high-concurrency performance

Recommendations

Use Case	Concurrency	Expected Throughput
Low latency	5–10	460–1,200 tok/s
Balanced	25–50	3,000–4,600 tok/s
High throughput	100–200	7,100–7,300 tok/s
Maximum throughput	500	15,343 tok/s

Test Configuration

Parameter	Value
Model	deepseek-ai/DeepSeek-V3.2
Precision	FP8
Tensor Parallelism	8
GPUs	8x MI325X (256GB each)
Total VRAM	2 TB
Test Mode	Thorough (3x multiplier)

Launch Command

                            bash
                            
                        
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_ROCM_USE_AITER=1" \
  --env "AITER_ENABLE_VSKIP=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 8 \
  --block-size 1 \
  --quantization fp8

Test Environment

Specification	Value
GPU	8x AMD Instinct MI325X
VRAM	256 GB HBM3E per GPU (2 TB total)
Architecture	CDNA 3 (gfx942)
ROCm	6.4.2-120
vLLM	0.14.1
Quantization	fp8

DeepSeek V3.2 (685B) Stress Testing

Concurrency Scaling

Scaling Results

Stress Tests

Stress Test Results

Saturation Testing

Extreme Load Results

Recommendations

Test Configuration

Launch Command

Test Environment

Comments

Products

Features

Solutions

Marketplace

Resources

Company