DeepSeek V4 Pro icon

DeepSeek V4 Pro

NVIDIA
DeepSeek V4 Pro is an ultra-large Mixture-of-Experts model designed for high-performance long-context reasoning and large-scale deployment. It features 1.6T total parameters with approximately 49B activated, using 384 routed experts with 6 selected per token across a 61-layer architecture with 7,168 hidden size and 128 attention heads. Built with hybrid compressed attention mechanisms and manifold-constrained hyper-connections, it supports up to a 1M token context window. With FP4 and FP8 mixed precision and advanced optimization techniques, it delivers strong efficiency, stability, and agentic reasoning performance across complex workloads.
TypeMoE LLM
CapabilitiesText Generation, Instruction Following, Reasoning, Mathematical Reasoning+5 more
Release Date24 April, 2026
Links
LicenseMIT

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE
docker run -it --rm 
 --runtime=nvidia 
 --gpus all 
 --ipc=host 
 --shm-size=128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 vllm/vllm-openai:v0.20.0 
 deepseek-ai/DeepSeek-V4-Pro 
 --attention_config.use_fp4_indexer_cache=True 
 --kv-cache-dtype fp8 
 --block-size 256 
 --tensor-parallel-size 8 
 --enable-expert-parallel 
 --max-model-len auto 
  --max-num-batched-tokens 65536 
 --gpu-memory-utilization 0.90 
 --tool-call-parser deepseek_v4 
 --reasoning-parser deepseek_v4 
 --tokenizer-mode deepseek_v4 
 --enable-auto-tool-choice 
 --max-num-seqs 1024 
 --trust-remote-code

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

Vultr Cloud GPU

NVIDIA HGX B200

Deploy NVIDIA B200 on Vultr Cloud GPU