Comprehensive stress testing of DeepSeek V3.2 (685B parameters) on 8x AMD Instinct MI325X GPUs.

| Concurrency | Throughput | Output tok/s | p99 Latency |
|---|---|---|---|
| 5 | 461 tok/s | 106 | 8.19s |
| 10 | 1,235 tok/s | 162 | 8.12s |
| 25 | 3,013 tok/s | 477 | 8.72s |
| 50 | 4,607 tok/s | 802 | 11.34s |
| 75 | 4,677 tok/s | 737 | 14.03s |
| 100 | 7,160 tok/s | 981 | 13.75s |
| 150 | 6,891 tok/s | 1,053 | 14.82s |
| 200 | 7,266 tok/s | 1,045 | 13.95s |
Observations:

| Test Type | Concurrency | Throughput | Output tok/s | p99 Latency |
|---|---|---|---|---|
| Long Output (500 tokens) text | 10 | 281 tok/s | 230 | 22.45s |
| Long Context (4K) text | 5 | 4,274 tok/s | 101 | 8.66s |
| Very Long Context (8K) text | 5 | 3,372 tok/s | 120 | 8.78s |
Key findings:

| Concurrency | Throughput | Output tok/s | p99 Latency | Status |
|---|---|---|---|---|
| 150 | 8,355 tok/s | 868 | 5.77s | OK |
| 200 | 10,864 tok/s | 867 | 5.73s | OK |
| 300 | 12,719 tok/s | 1,039 | 7.35s | OK |
| 500 | 15,343 tok/s | 1,239 | 9.08s | PEAK |
| 750 | 13,218 tok/s | 1,348 | 10.84s | SATURATED |
| 1000 | 14,148 tok/s | 1,276 | 9.99s | OK |
Observations:
| Use Case | Concurrency | Expected Throughput |
|---|---|---|
| Low latency | 5–10 | 460–1,200 tok/s |
| Balanced | 25–50 | 3,000–4,600 tok/s |
| High throughput | 100–200 | 7,100–7,300 tok/s |
| Maximum throughput | 500 | 15,343 tok/s |
| Parameter | Value |
|---|---|
| Model | deepseek-ai/DeepSeek-V3.2 |
| Precision | FP8 |
| Tensor Parallelism | 8 |
| GPUs | 8x MI325X (256GB each) |
| Total VRAM | 2 TB |
| Test Mode | Thorough (3x multiplier) |
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--env "VLLM_ROCM_USE_AITER=1" \
--env "AITER_ENABLE_VSKIP=0" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--block-size 1 \
--quantization fp8
| Specification | Value |
|---|---|
| GPU | 8x AMD Instinct MI325X |
| VRAM | 256 GB HBM3E per GPU (2 TB total) |
| Architecture | CDNA 3 (gfx942) |
| ROCm | 6.4.2-120 |
| vLLM | 0.14.1 |
| Quantization | fp8 |