Comprehensive stress benchmarks for Kimi-K2.5 (1 trillion parameters, 32B active per token) on 8x AMD Instinct MI325X GPUs with thorough testing (3x multiplier).

| Concurrency | Throughput | Output tok/s | p99 Latency | Status |
|---|---|---|---|---|
| 5 | 430 tok/s | 51 | 19.77s | DEGRADED |
| 10 | 670 tok/s | 79 | 25.40s | DEGRADED |
| 25 | 896 tok/s | 106 | 47.56s | DEGRADED |
| 50 | 1,632 tok/s | 193 | 52.02s | DEGRADED |
| 75 | 2,213 tok/s | 262 | 57.59s | DEGRADED |
| 100 | 2,656 tok/s | 314 | 63.86s | DEGRADED |
| 150 | 3,612 tok/s | 427 | 70.68s | DEGRADED |
| 200 | 3,754 tok/s | 444 | 74.89s | DEGRADED |
Observations:

| Test Type | Throughput | Output tok/s | p99 Latency | Status |
|---|---|---|---|---|
| long_output text | 200 tok/s | 169 | 137.51s | OK |
| long_context text | 1,253 tok/s | 96 | 50.12s | OK |
| very_long_context_8k text | 1,834 tok/s | 73 | 31.25s | OK |
| multi_image_3 multi-image | 1,475 tok/s | 95 | 77.16s | OK |
| multi_image_5 multi-image | 1,100 tok/s | 71 | 81.04s | OK |
| high_conc_vision vision | 1,793 tok/s | 301 | 100.77s | OK |
| sustained_vision vision | 836 tok/s | 177 | 114.56s | OK |
Key findings:

| Concurrency | Throughput | Success Rate | p99 Latency | Status |
|---|---|---|---|---|
| 150 | 3,628 tok/s | 100% | 69.74s | OK |
| 200 | 4,528 tok/s | 100% | 74.44s | OK |
| 300 | 5,820 tok/s | 100% | 86.81s | OK |
| 500 | 7,327 tok/s | 100% | 103.34s | OK |
| 750 | 7,304 tok/s | 100% | 103.66s | SATURATED |
| 1000 | 7,309 tok/s | 100% | 103.64s | SATURATED |
Observations:
| Use Case | Concurrency | Expected Throughput |
|---|---|---|
| Low latency | 1–10 | 430–670 tok/s |
| Balanced | 50–100 | 1,600–2,650 tok/s |
| High throughput | 150–500 | 3,600–7,300 tok/s |
| Parameter | Value |
|---|---|
| Model | moonshotai/Kimi-K2.5 |
| Test Mode | thorough (3x multiplier) |
| Timestamp | 20260203_165706 |
| Vision Model | Yes (MoonViT) |
docker run --rm \
--name vllm-kimi-k25 \
--ipc=host \
--network=host \
--group-add video \
--group-add render \
--cap-add=SYS_PTRACE \
--cap-add=CAP_SYS_ADMIN \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "VLLM_ROCM_USE_AITER=0" \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
rocm/vllm-dev:nightly \
vllm serve moonshotai/Kimi-K2.5 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--trust-remote-code \
--block-size 1 \
--mm-encoder-tp-mode data
| Specification | Value |
|---|---|
| GPU | 8x AMD Instinct MI325X |
| VRAM | 256 GB HBM3E per GPU (2 TB total) |
| Architecture | CDNA 3 (gfx942) |
| ROCm | 6.4.2-120 |
| vLLM | nightly (rocm/vllm-dev:nightly) |
| Tensor Parallel | 4 (required for AITER MLA) |
| Model | Total Params | Active Params | Peak Throughput | Saturation Point |
|---|---|---|---|---|
| Kimi-K2.5 | 1T | 32B | 7,327 tok/s | 750 concurrent |
| Qwen3-VL-235B | 235B | 22B | 47,873 tok/s | 750 concurrent |
| DeepSeek V3.2 | 685B | 37B | 7,266 tok/s | 200 concurrent |
| Llama-3.1-405B | 405B | 405B | 6,464 tok/s | 300 concurrent |