Deploy MiniMax's M2.5 on NVIDIA HGX B200 GPUs. This MoE model combines Lightning Attention with traditional SoftMax attention for efficient long-context inference.
| Property | Value |
|---|---|
| Model ID | MiniMaxAI/MiniMax-M2.5 |
| Architecture | MoE + Lightning Attention (linear + SoftMax hybrid) |
| Total Parameters | 229B |
| Active Parameters | 10B per token |
| Attention | Lightning Attention (linear O(n) intra-chunk) + SoftMax (inter-chunk) |
| Routing | MoE (256 experts, 8 active per token) |
| Context Length | 1,048,576 tokens (1M) |
| Quantization | Native FP8 support |
| License | MiniMax Model License |
| Link | HuggingFace |
MiniMax M2.5 uses Lightning Attention, a hybrid approach that splits sequence processing into two components:
Combined with MoE routing (only 10B of 229B parameters active per token), this creates an efficient model for long-context workloads. The 256-expert pool with 8 active per token provides high capacity with low per-token compute cost.
Implications for NVIDIA HGX B200 deployment:
$ vllm serve MiniMaxAI/MiniMax-M2.5 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
Or with Docker:
$ docker run --rm --gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:v0.16.0 \
--model MiniMaxAI/MiniMax-M2.5 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
--trust-remote-code is required — MiniMax M2.5 uses custom model code.
| Flag | Purpose |
|---|---|
--tensor-parallel-size 4 |
Distribute across 4 GPUs. Model weights ~115 GB in FP8 |
--max-model-len 32768 |
Context window. Supports up to 1M tokens |
--gpu-memory-utilization 0.90 |
Reserve 90% of VRAM for model + KV cache |
--trust-remote-code |
Required for custom architecture |
With TP=4 on FP8:
| Component | Per GPU | Total (4 GPUs) |
|---|---|---|
| Model weights | ~29 GB | ~115 GB |
| KV cache (available) | ~132 GB | ~528 GB |
| VRAM used | ~161 GB | ~644 GB |
Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=4 on 4x NVIDIA HGX B200.
| Concurrent | Output tok/s | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|---|
| 1 | 87 | 84 | 11.40 | 12.75 |
| 8 | 581 | 217 | 13.37 | 13.72 |
| 16 | 1,057 | 77 | 15.01 | 15.85 |
| 32 | 1,636 | 91 | 18.82 | 22.04 |
| 64 | 2,591 | 143 | 22.71 | 27.37 |
| 128 | 3,943 | 220 | 27.90 | 31.98 |
| 256 | 5,945 | 393 | 34.05 | 41.89 |
| 512 | 8,822 | 686 | 43.79 | 52.39 |
| 1024 | 8,838 | 687 | 43.73 | 52.25 |
Zero failed requests across all concurrency levels.
| Metric | Value |
|---|---|
| Peak sustained throughput | 8,838 tok/s (output) |
| Peak burst throughput | 9,600 tok/s |
| Saturation point | ~512 concurrent requests |
| TTFT at 32 concurrent | 91 ms |
| TPOT at 32 concurrent | 18.82 ms |
Throughput scales linearly from concurrency 1 to 512 (~101x increase), then plateaus. The model saturates at approximately 8,800 tok/s sustained output throughput on 4 GPUs. TTFT remains under 100ms up to concurrency 32, making this suitable for interactive applications at moderate load. TPOT stays under 20ms up to concurrency 32.
$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M2.5",
"messages": [{"role": "user", "content": "Explain Lightning Attention"}],
"max_tokens": 256
}'
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M2.5",
"prompt": "The key difference between linear and softmax attention is",
"max_tokens": 128
}'
A community-quantized NVFP4 variant is available for NVIDIA HGX B200:
$ vllm serve lukealonso/MiniMax-M2.5-NVFP4 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
NVFP4 could potentially halve the GPU count from TP=4 to TP=2. See FP8/NVFP4 Quantization for details.
RuntimeError: n_group should not be zero for DeepSeekV3 routing. The fused MoE kernel assumes DeepSeek V3-style grouped routing. Use vLLM 0.12.0 for this model. All benchmarks in this guide were run on vLLM 0.12.0. See Troubleshooting for details.--trust-remote-code. The Lightning Attention implementation requires custom model code.