Kimi-K2.5 (1T) Stress Testing

Updated on 11 March, 2026

Comprehensive stress benchmarks for Kimi-K2.5 (1 trillion parameters, 32B active per token) on 8x AMD Instinct MI325X GPUs with thorough testing (3x multiplier).


Concurrency Scaling

Kimi-K2.5 Scaling

Scaling Results

Concurrency Throughput Output tok/s p99 Latency Status
5 430 tok/s 51 19.77s DEGRADED
10 670 tok/s 79 25.40s DEGRADED
25 896 tok/s 106 47.56s DEGRADED
50 1,632 tok/s 193 52.02s DEGRADED
75 2,213 tok/s 262 57.59s DEGRADED
100 2,656 tok/s 314 63.86s DEGRADED
150 3,612 tok/s 427 70.68s DEGRADED
200 3,754 tok/s 444 74.89s DEGRADED

Observations:

  • Excellent linear scaling from 5 to 200 concurrent requests
  • Peak scaling throughput of 3,754 tok/s at 200 concurrent
  • MoE architecture (32B active parameters) enables efficient batching
  • DEGRADED status indicates p99 latency >2x baseline (expected under concurrent load)
  • 100% success rate maintained across all concurrency levels

Stress Tests

Kimi-K2.5 Stress

Stress Test Results

Test Type Throughput Output tok/s p99 Latency Status
long_output text 200 tok/s 169 137.51s OK
long_context text 1,253 tok/s 96 50.12s OK
very_long_context_8k text 1,834 tok/s 73 31.25s OK
multi_image_3 multi-image 1,475 tok/s 95 77.16s OK
multi_image_5 multi-image 1,100 tok/s 71 81.04s OK
high_conc_vision vision 1,793 tok/s 301 100.77s OK
sustained_vision vision 836 tok/s 177 114.56s OK

Key findings:

  • Long output generation (500 tokens): 200 tok/s total, 169 output tok/s
  • Long context (4K tokens): 1,253 tok/s with 50.1s p99 latency
  • Very long context (8K tokens): 1,834 tok/s with 31.3s p99 latency
  • Multi-image (3 images): 1,475 tok/s with 77.2s p99 latency
  • Multi-image (5 images): 1,100 tok/s with 81.0s p99 latency
  • High concurrency vision (100 concurrent): 1,793 tok/s, 301 output tok/s
  • Sustained vision (50 concurrent, 450 requests): 836 tok/s over 17 minutes

Saturation Testing

Kimi-K2.5 Saturation Testing

Extreme Load Results

Concurrency Throughput Success Rate p99 Latency Status
150 3,628 tok/s 100% 69.74s OK
200 4,528 tok/s 100% 74.44s OK
300 5,820 tok/s 100% 86.81s OK
500 7,327 tok/s 100% 103.34s OK
750 7,304 tok/s 100% 103.66s SATURATED
1000 7,309 tok/s 100% 103.64s SATURATED

Observations:

  • Peak throughput of 7,327 tok/s achieved at 500 concurrent
  • Saturation begins at 750 concurrent requests
  • 100% success rate maintained even under extreme load (1,000 concurrent)
  • Consistent ~7,300 tok/s across 500-1,000 concurrent

Recommendations

Use Case Concurrency Expected Throughput
Low latency 1–10 430–670 tok/s
Balanced 50–100 1,600–2,650 tok/s
High throughput 150–500 3,600–7,300 tok/s

Test Configuration

Parameter Value
Model moonshotai/Kimi-K2.5
Test Mode thorough (3x multiplier)
Timestamp 20260203_165706
Vision Model Yes (MoonViT)

Launch Command

bash
docker run --rm \
  --name vllm-kimi-k25 \
  --ipc=host \
  --network=host \
  --group-add video \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --cap-add=CAP_SYS_ADMIN \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "VLLM_ROCM_USE_AITER=0" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  rocm/vllm-dev:nightly \
  vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --block-size 1 \
  --mm-encoder-tp-mode data
Warning
Critical Settings
  • VLLM_ROCM_USE_AITER=0 - AITER disabled (MLA compatibility issues)
  • TP=4 required - MLA attention head distribution (64/4=16 heads per GPU)
  • --block-size 1 - Required for MLA architecture
  • VLLM_USE_TRITON_FLASH_ATTN=0 - Required for MoonViT vision encoder
  • --mm-encoder-tp-mode data - Vision encoder parallelism

Test Environment

Specification Value
GPU 8x AMD Instinct MI325X
VRAM 256 GB HBM3E per GPU (2 TB total)
Architecture CDNA 3 (gfx942)
ROCm 6.4.2-120
vLLM nightly (rocm/vllm-dev:nightly)
Tensor Parallel 4 (required for AITER MLA)

Comparison with Other Models

Model Total Params Active Params Peak Throughput Saturation Point
Kimi-K2.5 1T 32B 7,327 tok/s 750 concurrent
Qwen3-VL-235B 235B 22B 47,873 tok/s 750 concurrent
DeepSeek V3.2 685B 37B 7,266 tok/s 200 concurrent
Llama-3.1-405B 405B 405B 6,464 tok/s 300 concurrent
Note
Architecture Notes
  • Kimi-K2.5 uses MLA (Multi-head Latent Attention) like DeepSeek V3.2, which provides memory efficiency but has different scaling characteristics than GQA-based models like Qwen3-VL. The TP=4 requirement (vs TP=8 for other models) is due to AITER MLA's attention head distribution requirements.

Comments