Deploy Kimi-K2.5 (1 trillion parameters) on AMD Instinct GPUs.
| Property | Value |
|---|---|
| Model ID | moonshotai/Kimi-K2.5 |
| Architecture | MoE with MLA (384 experts, 8 selected per token) |
| Total Parameters | 1T (1 trillion) |
| Active Parameters | 32B per token |
| Context Length | 256K tokens |
| Vision | MoonViT encoder (400M params) |
| Download Size | ~400 GB (compressed-tensors) |
rocm/vllm-dev:nightly). The stable release does not yet support this model.
docker run --rm \
--name vllm-kimi-k25 \
--ipc=host \
--network=host \
--group-add video \
--group-add render \
--cap-add=SYS_PTRACE \
--cap-add=CAP_SYS_ADMIN \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "VLLM_ROCM_USE_AITER=0" \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
rocm/vllm-dev:nightly \
vllm serve moonshotai/Kimi-K2.5 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--trust-remote-code \
--block-size 1
VLLM_ROCM_USE_AITER=0 - AITER disabled (MLA head count incompatibility with TP=4)--tensor-parallel-size 4 - Not 8! Required for MLA attention head distribution (64/4=16 heads per GPU)--block-size 1 - Required for MLA architecture--trust-remote-code - Model uses custom codeVLLM_USE_TRITON_FLASH_ATTN=0 - Required for vision encoderEnable chat, tool calling, and reasoning mode:
docker run --rm \
--name vllm-kimi-k25 \
--ipc=host \
--network=host \
--group-add video \
--group-add render \
--cap-add=SYS_PTRACE \
--cap-add=CAP_SYS_ADMIN \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "VLLM_ROCM_USE_AITER=0" \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
rocm/vllm-dev:nightly \
vllm serve moonshotai/Kimi-K2.5 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--trust-remote-code \
--block-size 1 \
--mm-encoder-tp-mode data \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2
| Flag | Purpose |
|---|---|
--mm-encoder-tp-mode data |
Data parallel mode for MoonViT vision encoder |
--tool-call-parser kimi_k2 |
Parses tool/function calls in Kimi format |
--reasoning-parser kimi_k2 |
Extracts reasoning blocks |
| Configuration | Total Memory | Per GPU (TP=4) |
|---|---|---|
| Compressed-tensors (default) | ~160 GB | ~40 GB |
| Metric | Value |
|---|---|
| Load Time | ~5-8 minutes (first run) |
MI325X (256GB) easily fits the model with TP=4, leaving significant room for KV cache.
| Concurrent | Throughput | p99 Latency | Status |
|---|---|---|---|
| 10 | 225 tok/s | 77.83s | OK |
| 50 | 583 tok/s | 149.39s | OK |
| 100 | 948 tok/s | 183.35s | OK |
| 200 | 950 tok/s | 182.96s | OK |
| 500 | 948 tok/s | 183.23s | OK |
Multi-run means (n=5).
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2.5",
"messages": [{"role": "user", "content": "Hello, who are you?"}]
}'
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2.5",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What do you see in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}]
}'
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "moonshotai/Kimi-K2.5",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
}'
| Model | Total Params | Active Params | Architecture | Vision |
|---|---|---|---|---|
| Kimi-K2.5 | 1T | 32B | MoE + MLA | Yes (MoonViT) |
| DeepSeek V3.2 | 685B | ~37B | MoE + MLA | No |
| Qwen3-VL-235B | 235B | 22B | MoE + GQA | Yes |
Unlike other models, Kimi-K2.5 requires TP=4:
Error: AITER MLA requires 16 heads per GPUThe MLA architecture has 64 attention heads that must be evenly distributed with 16 heads per GPU.
Like DeepSeek V3.2, Kimi-K2.5 uses MLA which is incompatible with vLLM's KV cache offloading. The 256GB per MI325X provides sufficient capacity without offloading.
Do not use --kv-cache-dtype fp8:
Error: MLA doesn't support fp8 kv_cache_dtypeThe model uses its own compressed-tensors quantization format.
Ensure AITER_ENABLE_VSKIP=0 is set. If unset, it defaults to true which causes crashes on MI300X/MI325X.
If you see attention head distribution errors, verify you're using --tensor-parallel-size 4, not 8.
Kimi-K2.5 requires the nightly vLLM build. Use rocm/vllm-dev:nightly instead of the stable release.
Ensure VLLM_USE_TRITON_FLASH_ATTN=0 is set for the MoonViT encoder to work correctly.