Focus Mode

MiniMax M2.5

Updated on 11 March, 2026

Deploy MiniMax's M2.5 on NVIDIA HGX B200 GPUs. This MoE model combines Lightning Attention with traditional SoftMax attention for efficient long-context inference.

Model Overview

Property	Value
Model ID	`MiniMaxAI/MiniMax-M2.5`
Architecture	MoE + Lightning Attention (linear + SoftMax hybrid)
Total Parameters	229B
Active Parameters	10B per token
Attention	Lightning Attention (linear O(n) intra-chunk) + SoftMax (inter-chunk)
Routing	MoE (256 experts, 8 active per token)
Context Length	1,048,576 tokens (1M)
Quantization	Native FP8 support
License	MiniMax Model License
Link	HuggingFace

Architecture

MiniMax M2.5 uses Lightning Attention, a hybrid approach that splits sequence processing into two components:

Intra-chunk (linear attention): Processes tokens within a chunk in O(n) time, no KV cache needed
Inter-chunk (SoftMax attention): Standard attention across chunk boundaries, maintains KV cache

Combined with MoE routing (only 10B of 229B parameters active per token), this creates an efficient model for long-context workloads. The 256-expert pool with 8 active per token provides high capacity with low per-token compute cost.

Implications for NVIDIA HGX B200 deployment:

Moderate TP requirement: 10B active params fits well with TP=4 (4 GPUs)
Hybrid KV cache: Linear attention layers have no KV cache; SoftMax layers do
Good throughput scaling: MoE activates only 4.4% of parameters per token

Quick Start

                            console
                            
                        
$ vllm serve MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Or with Docker:

                            console
                            
                        
$ docker run --rm --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.16.0 \
  --model MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Note

--trust-remote-code is required — MiniMax M2.5 uses custom model code.

Configuration

Flag	Purpose
`--tensor-parallel-size 4`	Distribute across 4 GPUs. Model weights ~115 GB in FP8
`--max-model-len 32768`	Context window. Supports up to 1M tokens
`--gpu-memory-utilization 0.90`	Reserve 90% of VRAM for model + KV cache
`--trust-remote-code`	Required for custom architecture

Memory Usage (NVIDIA HGX B200 Verified)

With TP=4 on FP8:

Component	Per GPU	Total (4 GPUs)
Model weights	~29 GB	~115 GB
KV cache (available)	~132 GB	~528 GB
VRAM used	~161 GB	~644 GB

Performance (NVIDIA HGX B200 Verified)

Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=4 on 4x NVIDIA HGX B200.

Concurrency Scaling

Concurrent	Output tok/s	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
1	87	84	11.40	12.75
8	581	217	13.37	13.72
16	1,057	77	15.01	15.85
32	1,636	91	18.82	22.04
64	2,591	143	22.71	27.37
128	3,943	220	27.90	31.98
256	5,945	393	34.05	41.89
512	8,822	686	43.79	52.39
1024	8,838	687	43.73	52.25

Zero failed requests across all concurrency levels.

Peak Performance

Metric	Value
Peak sustained throughput	8,838 tok/s (output)
Peak burst throughput	9,600 tok/s
Saturation point	~512 concurrent requests
TTFT at 32 concurrent	91 ms
TPOT at 32 concurrent	18.82 ms

Scaling Behavior

Throughput scales linearly from concurrency 1 to 512 (~101x increase), then plateaus. The model saturates at approximately 8,800 tok/s sustained output throughput on 4 GPUs. TTFT remains under 100ms up to concurrency 32, making this suitable for interactive applications at moderate load. TPOT stays under 20ms up to concurrency 32.

Note

This model uses 4 of 8 available NVIDIA HGX B200 GPUs. In production, you could serve 2 independent instances on a single 8-GPU node, achieving ~17,600 tok/s aggregate throughput.

Test Endpoints

Chat Completion

                            console
                            
                        
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M2.5",
    "messages": [{"role": "user", "content": "Explain Lightning Attention"}],
    "max_tokens": 256
  }'

Text Completion

                            console
                            
                        
$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M2.5",
    "prompt": "The key difference between linear and softmax attention is",
    "max_tokens": 128
  }'

NVFP4 Variant

A community-quantized NVFP4 variant is available for NVIDIA HGX B200:

                            console
                            
                        
$ vllm serve lukealonso/MiniMax-M2.5-NVFP4 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

NVFP4 could potentially halve the GPU count from TP=4 to TP=2. See FP8/NVFP4 Quantization for details.

Known Issues

vLLM 0.16.0 incompatible: MiniMax M2.5 crashes on vLLM 0.16.0 with RuntimeError: n_group should not be zero for DeepSeekV3 routing. The fused MoE kernel assumes DeepSeek V3-style grouped routing. Use vLLM 0.12.0 for this model. All benchmarks in this guide were run on vLLM 0.12.0. See Troubleshooting for details.
MoE config warning: vLLM may log warnings about using default MoE config for NVIDIA HGX B200. Performance is still strong — tuned kernel configs for Blackwell are expected in future vLLM releases.
Custom code: Always pass --trust-remote-code. The Lightning Attention implementation requires custom model code.

MiniMax M2.5

Model Overview

Architecture

Quick Start

Configuration

Memory Usage (NVIDIA HGX B200 Verified)

Performance (NVIDIA HGX B200 Verified)

Concurrency Scaling

Peak Performance

Scaling Behavior

Test Endpoints

Chat Completion

Text Completion

NVFP4 Variant

Known Issues

Comments

Products

Features

Solutions

Marketplace

Resources

Company