Focus Mode

Kimi-K2.5 (1T) Stress Testing

Updated on 11 March, 2026

Comprehensive stress benchmarks for Kimi-K2.5 (1 trillion parameters, 32B active per token) on 8x AMD Instinct MI325X GPUs with thorough testing (3x multiplier).

Concurrency Scaling

Kimi-K2.5 Scaling

Scaling Results

Concurrency	Throughput	Output tok/s	p99 Latency	Status
5	430 tok/s	51	19.77s	DEGRADED
10	670 tok/s	79	25.40s	DEGRADED
25	896 tok/s	106	47.56s	DEGRADED
50	1,632 tok/s	193	52.02s	DEGRADED
75	2,213 tok/s	262	57.59s	DEGRADED
100	2,656 tok/s	314	63.86s	DEGRADED
150	3,612 tok/s	427	70.68s	DEGRADED
200	3,754 tok/s	444	74.89s	DEGRADED

Observations:

Excellent linear scaling from 5 to 200 concurrent requests
Peak scaling throughput of 3,754 tok/s at 200 concurrent
MoE architecture (32B active parameters) enables efficient batching
DEGRADED status indicates p99 latency >2x baseline (expected under concurrent load)
100% success rate maintained across all concurrency levels

Stress Tests

Kimi-K2.5 Stress

Stress Test Results

Test Type	Throughput	Output tok/s	p99 Latency	Status
long_output text	200 tok/s	169	137.51s	OK
long_context text	1,253 tok/s	96	50.12s	OK
very_long_context_8k text	1,834 tok/s	73	31.25s	OK
multi_image_3 multi-image	1,475 tok/s	95	77.16s	OK
multi_image_5 multi-image	1,100 tok/s	71	81.04s	OK
high_conc_vision vision	1,793 tok/s	301	100.77s	OK
sustained_vision vision	836 tok/s	177	114.56s	OK

Key findings:

Long output generation (500 tokens): 200 tok/s total, 169 output tok/s
Long context (4K tokens): 1,253 tok/s with 50.1s p99 latency
Very long context (8K tokens): 1,834 tok/s with 31.3s p99 latency
Multi-image (3 images): 1,475 tok/s with 77.2s p99 latency
Multi-image (5 images): 1,100 tok/s with 81.0s p99 latency
High concurrency vision (100 concurrent): 1,793 tok/s, 301 output tok/s
Sustained vision (50 concurrent, 450 requests): 836 tok/s over 17 minutes

Saturation Testing

Kimi-K2.5 Saturation Testing

Extreme Load Results

Concurrency	Throughput	Success Rate	p99 Latency	Status
150	3,628 tok/s	100%	69.74s	OK
200	4,528 tok/s	100%	74.44s	OK
300	5,820 tok/s	100%	86.81s	OK
500	7,327 tok/s	100%	103.34s	OK
750	7,304 tok/s	100%	103.66s	SATURATED
1000	7,309 tok/s	100%	103.64s	SATURATED

Observations:

Peak throughput of 7,327 tok/s achieved at 500 concurrent
Saturation begins at 750 concurrent requests
100% success rate maintained even under extreme load (1,000 concurrent)
Consistent ~7,300 tok/s across 500-1,000 concurrent

Recommendations

Use Case	Concurrency	Expected Throughput
Low latency	1–10	430–670 tok/s
Balanced	50–100	1,600–2,650 tok/s
High throughput	150–500	3,600–7,300 tok/s

Test Configuration

Parameter	Value
Model	moonshotai/Kimi-K2.5
Test Mode	thorough (3x multiplier)
Timestamp	20260203_165706
Vision Model	Yes (MoonViT)

Launch Command

                            bash
                            
                        
docker run --rm \
  --name vllm-kimi-k25 \
  --ipc=host \
  --network=host \
  --group-add video \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --cap-add=CAP_SYS_ADMIN \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "VLLM_ROCM_USE_AITER=0" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  rocm/vllm-dev:nightly \
  vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --block-size 1 \
  --mm-encoder-tp-mode data

Warning

Critical Settings

VLLM_ROCM_USE_AITER=0 - AITER disabled (MLA compatibility issues)
TP=4 required - MLA attention head distribution (64/4=16 heads per GPU)
--block-size 1 - Required for MLA architecture
VLLM_USE_TRITON_FLASH_ATTN=0 - Required for MoonViT vision encoder
--mm-encoder-tp-mode data - Vision encoder parallelism

Test Environment

Specification	Value
GPU	8x AMD Instinct MI325X
VRAM	256 GB HBM3E per GPU (2 TB total)
Architecture	CDNA 3 (gfx942)
ROCm	6.4.2-120
vLLM	nightly (rocm/vllm-dev:nightly)
Tensor Parallel	4 (required for AITER MLA)

Comparison with Other Models

Model	Total Params	Active Params	Peak Throughput	Saturation Point
Kimi-K2.5	1T	32B	7,327 tok/s	750 concurrent
Qwen3-VL-235B	235B	22B	47,873 tok/s	750 concurrent
DeepSeek V3.2	685B	37B	7,266 tok/s	200 concurrent
Llama-3.1-405B	405B	405B	6,464 tok/s	300 concurrent

Note

Architecture Notes

Kimi-K2.5 uses MLA (Multi-head Latent Attention) like DeepSeek V3.2, which provides memory efficiency but has different scaling characteristics than GQA-based models like Qwen3-VL. The TP=4 requirement (vs TP=8 for other models) is due to AITER MLA's attention head distribution requirements.

Kimi-K2.5 (1T) Stress Testing

Concurrency Scaling

Scaling Results

Stress Tests

Stress Test Results

Saturation Testing

Extreme Load Results

Recommendations

Test Configuration

Launch Command

Test Environment

Comparison with Other Models

Comments

Products

Features

Solutions

Marketplace

Resources

Company