MiMo V2.5 icon

MiMo V2.5

NVIDIA
Xiaomi MiMo V2.5 is a native omnimodal Mixture-of-Experts model designed for unified multimodal reasoning across text, image, video, and audio. It features 310B total parameters with 15B active, using a 48-layer architecture with 4,096 hidden size and hybrid Sliding Window and Global Attention for efficient long-context processing. The model integrates a 729M-parameter vision encoder and a dedicated audio encoder, enabling rich perception capabilities. Supporting up to a 1M token context, it excels in long-horizon reasoning and advanced agentic workflows.
TypeOmni Model
CapabilitiesText Generation, Instruction Following, Reasoning, Mathematical Reasoning+7 more
Release Date28 April, 2026
Links
LicenseMIT

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE
docker run -it --rm 
 --runtime=nvidia 
 --gpus all 
 --ipc=host 
 --shm-size=128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 vllm/vllm-openai:mimov25-cu130 
 XiaomiMiMo/MiMo-V2.5 
  --tensor-parallel-size 4 
 --enable-expert-parallel 
 --max-model-len auto 
 --max-num-batched-tokens 65536 
 --gpu-memory-utilization 0.95 
  --max-num-seqs 1024 
 --enable-auto-tool-choice 
 --reasoning-parser mimo 
 --tool-call-parser mimo 
 --generation-config vllm 
 --trust-remote-code
Note

For MiMo V2.5 support, use vllm/vllm-openai:mimov25-cu130 for CUDA 13, vllm/vllm-openai:mimov25-cu129 for CUDA 12.9, or any subsequent official vLLM release.

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

Vultr Cloud GPU

NVIDIA HGX B200

Deploy NVIDIA B200 on Vultr Cloud GPU