Kimi-K2.5 (1T)

Updated on 17 March, 2026

Deploy Kimi-K2.5 (1 trillion parameters) on AMD Instinct GPUs.


Model Overview

Property Value
Model ID moonshotai/Kimi-K2.5
Architecture MoE with MLA (384 experts, 8 selected per token)
Total Parameters 1T (1 trillion)
Active Parameters 32B per token
Context Length 256K tokens
Vision MoonViT encoder (400M params)
Download Size ~400 GB (compressed-tensors)

Quick Start

Warning
Kimi-K2.5 requires vLLM nightly build (rocm/vllm-dev:nightly). The stable release does not yet support this model.
bash
docker run --rm \
  --name vllm-kimi-k25 \
  --ipc=host \
  --network=host \
  --group-add video \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --cap-add=CAP_SYS_ADMIN \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "VLLM_ROCM_USE_AITER=0" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  rocm/vllm-dev:nightly \
  vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --block-size 1

Critical Configuration

Warning
These settings are mandatory for Kimi-K2.5 on AMD:
  • VLLM_ROCM_USE_AITER=0 - AITER disabled (MLA head count incompatibility with TP=4)
  • --tensor-parallel-size 4 - Not 8! Required for MLA attention head distribution (64/4=16 heads per GPU)
  • --block-size 1 - Required for MLA architecture
  • --trust-remote-code - Model uses custom code
  • VLLM_USE_TRITON_FLASH_ATTN=0 - Required for vision encoder
Note
Why AITER is Disabled? Unlike other models, Kimi-K2.5 runs with AITER disabled. The AITER MLA backend requires specific head counts (16 or 128 per GPU). With TP=8, Kimi-K2.5's 64 heads would give 8 heads per GPU (unsupported). With TP=4, we get 16 heads per GPU, but AITER still has compatibility issues with this model's MLA implementation.

Full Features Configuration

Enable chat, tool calling, and reasoning mode:

bash
docker run --rm \
  --name vllm-kimi-k25 \
  --ipc=host \
  --network=host \
  --group-add video \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --cap-add=CAP_SYS_ADMIN \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "VLLM_ROCM_USE_AITER=0" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  rocm/vllm-dev:nightly \
  vllm serve moonshotai/Kimi-K2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --block-size 1 \
  --mm-encoder-tp-mode data \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2

Additional Flags Explained

Flag Purpose
--mm-encoder-tp-mode data Data parallel mode for MoonViT vision encoder
--tool-call-parser kimi_k2 Parses tool/function calls in Kimi format
--reasoning-parser kimi_k2 Extracts reasoning blocks

Memory Usage

Configuration Total Memory Per GPU (TP=4)
Compressed-tensors (default) ~160 GB ~40 GB
Metric Value
Load Time ~5-8 minutes (first run)

MI325X (256GB) easily fits the model with TP=4, leaving significant room for KV cache.

Performance (MI325X Verified)

Concurrent Throughput p99 Latency Status
10 225 tok/s 77.83s OK
50 583 tok/s 149.39s OK
100 948 tok/s 183.35s OK
200 950 tok/s 182.96s OK
500 948 tok/s 183.23s OK

Multi-run means (n=5).

Note
  • 100% success rate across all concurrency levels (no failures)
  • Peak throughput: 952 tok/s at 1,000 concurrent requests
  • Saturation point: ~100 concurrent requests
See Kimi-K2.5 Stress Testing for detailed benchmark results including saturation testing up to 1,000 concurrent requests.

Test Endpoints

Chat Completion

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{"role": "user", "content": "Hello, who are you?"}]
  }'

Vision Request

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What do you see in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
      ]
    }]
  }'

Tool Calling

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.5",
    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }]
  }'

Comparison with Similar Models

Model Total Params Active Params Architecture Vision
Kimi-K2.5 1T 32B MoE + MLA Yes (MoonViT)
DeepSeek V3.2 685B ~37B MoE + MLA No
Qwen3-VL-235B 235B 22B MoE + GQA Yes

Known Limitations

Tensor Parallel Size Must Be 4

Unlike other models, Kimi-K2.5 requires TP=4:

Error: AITER MLA requires 16 heads per GPU

The MLA architecture has 64 attention heads that must be evenly distributed with 16 heads per GPU.

KV Cache Offloading Not Supported

Like DeepSeek V3.2, Kimi-K2.5 uses MLA which is incompatible with vLLM's KV cache offloading. The 256GB per MI325X provides sufficient capacity without offloading.

FP8 KV Cache Not Supported

Do not use --kv-cache-dtype fp8:

Error: MLA doesn't support fp8 kv_cache_dtype

The model uses its own compressed-tensors quantization format.

Troubleshooting

Crash on Startup

Ensure AITER_ENABLE_VSKIP=0 is set. If unset, it defaults to true which causes crashes on MI300X/MI325X.

TP Size Error

If you see attention head distribution errors, verify you're using --tensor-parallel-size 4, not 8.

Model Not Found

Kimi-K2.5 requires the nightly vLLM build. Use rocm/vllm-dev:nightly instead of the stable release.

Vision Requests Fail

Ensure VLLM_USE_TRITON_FLASH_ATTN=0 is set for the MoonViT encoder to work correctly.

Comments