Focus Mode

Kimi-K2.5 (1T)

Updated on 17 March, 2026

Deploy Kimi-K2.5 (1 trillion parameters) on AMD Instinct GPUs.

Model Overview

Property	Value
Model ID	`moonshotai/Kimi-K2.5`
Architecture	MoE with MLA (384 experts, 8 selected per token)
Total Parameters	1T (1 trillion)
Active Parameters	32B per token
Context Length	256K tokens
Vision	MoonViT encoder (400M params)
Download Size	~400 GB (compressed-tensors)

Quick Start

Warning

Kimi-K2.5 requires vLLM nightly build (rocm/vllm-dev:nightly). The stable release does not yet support this model.

                            bash
                            
                        
docker run --rm \
  --name vllm-kimi-k25 \
  --ipc=host \
  --network=host \
  --group-add video \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --cap-add=CAP_SYS_ADMIN \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "VLLM_ROCM_USE_AITER=0" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  rocm/vllm-dev:nightly \
  vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --block-size 1

Critical Configuration

Warning

These settings are mandatory for Kimi-K2.5 on AMD:

VLLM_ROCM_USE_AITER=0 - AITER disabled (MLA head count incompatibility with TP=4)
--tensor-parallel-size 4 - Not 8! Required for MLA attention head distribution (64/4=16 heads per GPU)
--block-size 1 - Required for MLA architecture
--trust-remote-code - Model uses custom code
VLLM_USE_TRITON_FLASH_ATTN=0 - Required for vision encoder

Note

Why AITER is Disabled? Unlike other models, Kimi-K2.5 runs with AITER disabled. The AITER MLA backend requires specific head counts (16 or 128 per GPU). With TP=8, Kimi-K2.5's 64 heads would give 8 heads per GPU (unsupported). With TP=4, we get 16 heads per GPU, but AITER still has compatibility issues with this model's MLA implementation.

Full Features Configuration

Enable chat, tool calling, and reasoning mode:

                            bash
                            
                        
docker run --rm \
  --name vllm-kimi-k25 \
  --ipc=host \
  --network=host \
  --group-add video \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --cap-add=CAP_SYS_ADMIN \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "VLLM_ROCM_USE_AITER=0" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  rocm/vllm-dev:nightly \
  vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --block-size 1 \
  --mm-encoder-tp-mode data \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2

Additional Flags Explained

Flag	Purpose
`--mm-encoder-tp-mode data`	Data parallel mode for MoonViT vision encoder
`--tool-call-parser kimi_k2`	Parses tool/function calls in Kimi format
`--reasoning-parser kimi_k2`	Extracts reasoning blocks

Memory Usage

Configuration	Total Memory	Per GPU (TP=4)
Compressed-tensors (default)	~160 GB	~40 GB

Metric	Value
Load Time	~5-8 minutes (first run)

MI325X (256GB) easily fits the model with TP=4, leaving significant room for KV cache.

Performance (MI325X Verified)

Concurrent	Throughput	p99 Latency	Status
10	225 tok/s	77.83s	OK
50	583 tok/s	149.39s	OK
100	948 tok/s	183.35s	OK
200	950 tok/s	182.96s	OK
500	948 tok/s	183.23s	OK

Multi-run means (n=5).

Note

100% success rate across all concurrency levels (no failures)
Peak throughput: 952 tok/s at 1,000 concurrent requests
Saturation point: ~100 concurrent requests

See Kimi-K2.5 Stress Testing for detailed benchmark results including saturation testing up to 1,000 concurrent requests.

Test Endpoints

Chat Completion

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{"role": "user", "content": "Hello, who are you?"}]
  }'

Vision Request

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What do you see in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
      ]
    }]
  }'

Tool Calling

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }]
  }'

Comparison with Similar Models

Model	Total Params	Active Params	Architecture	Vision
Kimi-K2.5	1T	32B	MoE + MLA	Yes (MoonViT)
DeepSeek V3.2	685B	~37B	MoE + MLA	No
Qwen3-VL-235B	235B	22B	MoE + GQA	Yes

Known Limitations

Tensor Parallel Size Must Be 4

Unlike other models, Kimi-K2.5 requires TP=4:

Error: AITER MLA requires 16 heads per GPU

The MLA architecture has 64 attention heads that must be evenly distributed with 16 heads per GPU.

KV Cache Offloading Not Supported

Like DeepSeek V3.2, Kimi-K2.5 uses MLA which is incompatible with vLLM's KV cache offloading. The 256GB per MI325X provides sufficient capacity without offloading.

FP8 KV Cache Not Supported

Do not use --kv-cache-dtype fp8:

Error: MLA doesn't support fp8 kv_cache_dtype

The model uses its own compressed-tensors quantization format.

Troubleshooting

Crash on Startup

Ensure AITER_ENABLE_VSKIP=0 is set. If unset, it defaults to true which causes crashes on MI300X/MI325X.

TP Size Error

If you see attention head distribution errors, verify you're using --tensor-parallel-size 4, not 8.

Model Not Found

Kimi-K2.5 requires the nightly vLLM build. Use rocm/vllm-dev:nightly instead of the stable release.

Vision Requests Fail

Ensure VLLM_USE_TRITON_FLASH_ATTN=0 is set for the MoonViT encoder to work correctly.