Mistral Large 3 675B Instruct 2512 icon

Mistral Large 3 675B Instruct 2512

NVIDIA
Mistral-Large-3-675B-Instruct-2512 is a large-scale multimodal Mixture-of-Experts (MoE) model designed for instruction following, agentic workflows, and long-context understanding. It has 675B total parameters with about 41B activated, using a 61-layer transformer with 128 attention heads and a 7,168 hidden size. The architecture includes 128 experts with 4 selected per token and supports a context window of ~294K tokens. It also integrates a 2.5B vision encoder and FP8 quantization, enabling efficient deployment and strong performance across text and vision tasks.
TypeVision-Language Model
CapabilitiesText Generation, Instruction Following, Reasoning, Mathematical Reasoning+6 more
Release DateDecember 1, 2025
Links
LicenseApache 2.0

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE
docker run -it --rm 
 --runtime=nvidia 
 --gpus all 
 --ipc=host 
 --shm-size=128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 vllm/vllm-openai:v0.18.0 
 mistralai/Mistral-Large-3-675B-Instruct-2512 
  --tensor-parallel-size 8 
 --max-model-len auto 
 --reasoning-parser mistral 
 --tool-call-parser mistral 
 --tokenizer_mode mistral 
 --config_format mistral 
 --load_format mistral 
 --enable-auto-tool-choice 
  --gpu-memory-utilization 0.9 
  --max-num-seqs 1024 
  --trust-remote-code

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

Vultr Cloud GPU

NVIDIA HGX B200

Deploy NVIDIA B200 on Vultr Cloud GPU

How to Deploy Mistral Large 3 675B Instruct 2512 on NVIDIA GPUs | Vultr Docs