Latest ContentInference Cookbook Model Library

MiMo V2 Flash

MiMo V2 Flash is a large-scale Mixture-of-Experts (MoE) language model developed by Xiaomi for high-speed reasoning, coding, and agentic workflows. The model features 309B total parameters with 15B activated, built on a 48-layer transformer architecture with 64 attention heads and a 4,096 hidden size. It utilizes a hybrid attention design that interleaves Sliding Window Attention and Global Attention in a 5:1 ratio, significantly reducing KV-cache memory while maintaining long-context performance. The model supports up to a 256K token context window and integrates Multi-Token Prediction (MTP) to accelerate generation throughput.

Type	MoE LLM
Capabilities	Text Generation, Instruction Following, Reasoning, Mathematical Reasoning+5 more
Release Date	16 December, 2025
Links	Paper\|Blog\|HF Model Card
License	MIT

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE

docker run -it --rm 
 --runtime=nvidia 
 --gpus all 
 --ipc=host 
 --shm-size=128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 -e LD_LIBRARY_PATH='/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu' 
 vllm/vllm-openai:v0.15.0-cu130 
 XiaomiMiMo/MiMo-V2-Flash 
  --tensor-parallel-size 4 
 --max-model-len auto 
  --gpu-memory-utilization 0.95 
 --max-num-batched-tokens 65536 
 --tool-call-parser qwen3_xml 
 --reasoning-parser qwen3 
 --enable-auto-tool-choice 
 --max-num-seqs 1024 
 --disable-log-requests 
 --generation-config vllm 
 --trust-remote-code

Note

Due to the model architecture having only 4 Key-Value heads, the maximum supported Tensor Parallelism is 4.

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

MiMo V2 Flash

Inference Instructions

Model Benchmarks

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

NVIDIA HGX B200

Products

Features

Solutions

Marketplace

Resources

Company

Tech Talks

Vultr Blogs