DeepSeek V3.2 (685B) Stress Testing

Updated on 11 March, 2026

Comprehensive stress testing of DeepSeek V3.2 (685B parameters) on 8x AMD Instinct MI325X GPUs.


Concurrency Scaling

Concurrency Scaling

Scaling Results

Concurrency Throughput Output tok/s p99 Latency
5 461 tok/s 106 8.19s
10 1,235 tok/s 162 8.12s
25 3,013 tok/s 477 8.72s
50 4,607 tok/s 802 11.34s
75 4,677 tok/s 737 14.03s
100 7,160 tok/s 981 13.75s
150 6,891 tok/s 1,053 14.82s
200 7,266 tok/s 1,045 13.95s

Observations:

  • Strong scaling from 5 to 200 concurrent requests
  • Throughput increases from 461 tok/s at 5 concurrent to 7,266 tok/s at 200 concurrent
  • Peak scaling throughput of 7,266 tok/s at 200 concurrent
  • p99 latency increases moderately from 8s to 14s across the range

Stress Tests

Stress Tests

Stress Test Results

Test Type Concurrency Throughput Output tok/s p99 Latency
Long Output (500 tokens) text 10 281 tok/s 230 22.45s
Long Context (4K) text 5 4,274 tok/s 101 8.66s
Very Long Context (8K) text 5 3,372 tok/s 120 8.78s

Key findings:

  • Long output generation (500 tokens): 281 tok/s with 22.5s p99 latency
  • Long context (4K tokens): 4,274 tok/s total throughput
  • Very long context (8K tokens): 3,372 tok/s total throughput
  • All tests passed with 100% success rate

Saturation Testing

Saturation Testing

Extreme Load Results

Concurrency Throughput Output tok/s p99 Latency Status
150 8,355 tok/s 868 5.77s OK
200 10,864 tok/s 867 5.73s OK
300 12,719 tok/s 1,039 7.35s OK
500 15,343 tok/s 1,239 9.08s PEAK
750 13,218 tok/s 1,348 10.84s SATURATED
1000 14,148 tok/s 1,276 9.99s OK

Observations:

  • Peak throughput of 15,343 tok/s achieved at 500 concurrent
  • Saturation begins at 750 concurrent (throughput plateau)
  • 100% success rate maintained even under extreme load (1,000 concurrent)
  • MLA + AITER combination enables excellent high-concurrency performance

Recommendations

Use Case Concurrency Expected Throughput
Low latency 5–10 460–1,200 tok/s
Balanced 25–50 3,000–4,600 tok/s
High throughput 100–200 7,100–7,300 tok/s
Maximum throughput 500 15,343 tok/s

Test Configuration

Parameter Value
Model deepseek-ai/DeepSeek-V3.2
Precision FP8
Tensor Parallelism 8
GPUs 8x MI325X (256GB each)
Total VRAM 2 TB
Test Mode Thorough (3x multiplier)

Launch Command

bash
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_ROCM_USE_AITER=1" \
  --env "AITER_ENABLE_VSKIP=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 8 \
  --block-size 1 \
  --quantization fp8

Test Environment

Specification Value
GPU 8x AMD Instinct MI325X
VRAM 256 GB HBM3E per GPU (2 TB total)
Architecture CDNA 3 (gfx942)
ROCm 6.4.2-120
vLLM 0.14.1
Quantization fp8

Comments