MiMo V2 Flash icon

MiMo V2 Flash

NVIDIA
MiMo-V2-Flash is a large-scale Mixture-of-Experts (MoE) language model developed by Xiaomi for high-speed reasoning, coding, and agentic workflows. The model features 309B total parameters with 15B activated, built on a 48-layer transformer architecture with 64 attention heads and a 4,096 hidden size. It utilizes a hybrid attention design that interleaves Sliding Window Attention and Global Attention in a 5:1 ratio, significantly reducing KV-cache memory while maintaining long-context performance. The model supports up to a 256K token context window and integrates Multi-Token Prediction (MTP) to accelerate generation throughput.
TypeMoE LLM
CapabilitiesText Generation, Instruction Following, Reasoning, Mathematical Reasoning+5 more
Group Release DateDecember 15, 2025
Links
LicenseMIT

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE
docker run -it --rm 
 --runtime=nvidia 
 --gpus all 
 --ipc=host 
 --shm-size=128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 -e LD_LIBRARY_PATH='/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu' 
 vllm/vllm-openai:v0.15.0-cu130 
 XiaomiMiMo/MiMo-V2-Flash 
  --tensor-parallel-size 4 
 --max-model-len auto 
  --gpu-memory-utilization 0.95 
 --max-num-batched-tokens 65536 
 --tool-call-parser qwen3_xml 
 --reasoning-parser qwen3 
 --enable-auto-tool-choice 
 --max-num-seqs 1024 
 --disable-log-requests 
 --generation-config vllm 
 --trust-remote-code
Note

Due to the model architecture having only 4 Key-Value heads, the maximum supported Tensor Parallelism is 4.

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

Vultr Cloud GPU

NVIDIA HGX B200

Deploy NVIDIA B200 on Vultr Cloud GPU

How to Deploy MiMo V2 Flash on NVIDIA GPUs | Vultr Docs