MiniMax M2.5

Updated on 11 March, 2026

Deploy MiniMax's M2.5 on NVIDIA HGX B200 GPUs. This MoE model combines Lightning Attention with traditional SoftMax attention for efficient long-context inference.


Model Overview

Property Value
Model ID MiniMaxAI/MiniMax-M2.5
Architecture MoE + Lightning Attention (linear + SoftMax hybrid)
Total Parameters 229B
Active Parameters 10B per token
Attention Lightning Attention (linear O(n) intra-chunk) + SoftMax (inter-chunk)
Routing MoE (256 experts, 8 active per token)
Context Length 1,048,576 tokens (1M)
Quantization Native FP8 support
License MiniMax Model License
Link HuggingFace

Architecture

MiniMax M2.5 uses Lightning Attention, a hybrid approach that splits sequence processing into two components:

  • Intra-chunk (linear attention): Processes tokens within a chunk in O(n) time, no KV cache needed
  • Inter-chunk (SoftMax attention): Standard attention across chunk boundaries, maintains KV cache

Combined with MoE routing (only 10B of 229B parameters active per token), this creates an efficient model for long-context workloads. The 256-expert pool with 8 active per token provides high capacity with low per-token compute cost.

Implications for NVIDIA HGX B200 deployment:

  1. Moderate TP requirement: 10B active params fits well with TP=4 (4 GPUs)
  2. Hybrid KV cache: Linear attention layers have no KV cache; SoftMax layers do
  3. Good throughput scaling: MoE activates only 4.4% of parameters per token

Quick Start

console
$ vllm serve MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Or with Docker:

console
$ docker run --rm --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.16.0 \
  --model MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code
Note
--trust-remote-code is required — MiniMax M2.5 uses custom model code.

Configuration

Flag Purpose
--tensor-parallel-size 4 Distribute across 4 GPUs. Model weights ~115 GB in FP8
--max-model-len 32768 Context window. Supports up to 1M tokens
--gpu-memory-utilization 0.90 Reserve 90% of VRAM for model + KV cache
--trust-remote-code Required for custom architecture

Memory Usage (NVIDIA HGX B200 Verified)

With TP=4 on FP8:

Component Per GPU Total (4 GPUs)
Model weights ~29 GB ~115 GB
KV cache (available) ~132 GB ~528 GB
VRAM used ~161 GB ~644 GB

Performance (NVIDIA HGX B200 Verified)

Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=4 on 4x NVIDIA HGX B200.

Concurrency Scaling

Concurrent Output tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 87 84 11.40 12.75
8 581 217 13.37 13.72
16 1,057 77 15.01 15.85
32 1,636 91 18.82 22.04
64 2,591 143 22.71 27.37
128 3,943 220 27.90 31.98
256 5,945 393 34.05 41.89
512 8,822 686 43.79 52.39
1024 8,838 687 43.73 52.25

Zero failed requests across all concurrency levels.

Peak Performance

Metric Value
Peak sustained throughput 8,838 tok/s (output)
Peak burst throughput 9,600 tok/s
Saturation point ~512 concurrent requests
TTFT at 32 concurrent 91 ms
TPOT at 32 concurrent 18.82 ms

Scaling Behavior

Throughput scales linearly from concurrency 1 to 512 (~101x increase), then plateaus. The model saturates at approximately 8,800 tok/s sustained output throughput on 4 GPUs. TTFT remains under 100ms up to concurrency 32, making this suitable for interactive applications at moderate load. TPOT stays under 20ms up to concurrency 32.

Note
This model uses 4 of 8 available NVIDIA HGX B200 GPUs. In production, you could serve 2 independent instances on a single 8-GPU node, achieving ~17,600 tok/s aggregate throughput.

Test Endpoints

Chat Completion

console
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M2.5",
    "messages": [{"role": "user", "content": "Explain Lightning Attention"}],
    "max_tokens": 256
  }'

Text Completion

console
$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M2.5",
    "prompt": "The key difference between linear and softmax attention is",
    "max_tokens": 128
  }'

NVFP4 Variant

A community-quantized NVFP4 variant is available for NVIDIA HGX B200:

console
$ vllm serve lukealonso/MiniMax-M2.5-NVFP4 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

NVFP4 could potentially halve the GPU count from TP=4 to TP=2. See FP8/NVFP4 Quantization for details.

Known Issues

  • vLLM 0.16.0 incompatible: MiniMax M2.5 crashes on vLLM 0.16.0 with RuntimeError: n_group should not be zero for DeepSeekV3 routing. The fused MoE kernel assumes DeepSeek V3-style grouped routing. Use vLLM 0.12.0 for this model. All benchmarks in this guide were run on vLLM 0.12.0. See Troubleshooting for details.
  • MoE config warning: vLLM may log warnings about using default MoE config for NVIDIA HGX B200. Performance is still strong — tuned kernel configs for Blackwell are expected in future vLLM releases.
  • Custom code: Always pass --trust-remote-code. The Lightning Attention implementation requires custom model code.

Comments