DeepSeek V4 Flash icon

DeepSeek V4 Flash

NVIDIA
DeepSeek V4 Flash is a large-scale Mixture-of-Experts model optimized for ultra-long context reasoning and efficient inference. It features 284B total parameters with approximately 13B activated, using 256 routed experts with 6 selected per token across a 43-layer architecture with 4,096 hidden size and 64 attention heads. Built with hybrid CSA + HCA attention and manifold-constrained hyper-connections, it supports up to a 1M token context window. With FP4 and FP8 mixed precision and strong agentic tool-calling capabilities, it is designed for scalable, high-efficiency reasoning and multi-domain workloads.
TypeMoE LLM
CapabilitiesText Generation, Instruction Following, Reasoning, Mathematical Reasoning+5 more
Release Date24 April, 2026
Links
LicenseMIT

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE
docker run -it --rm 
 --runtime=nvidia 
 --gpus all 
 --ipc=host 
 --shm-size=128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 vllm/vllm-openai:v0.20.0 
 deepseek-ai/DeepSeek-V4-Flash 
 --attention_config.use_fp4_indexer_cache=True 
 --kv-cache-dtype fp8 
 --block-size 256 
 --tensor-parallel-size 4 
 --enable-expert-parallel 
 --max-model-len auto 
  --max-num-batched-tokens 65536 
 --compilation-config '{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\", \"custom_ops\":[\"all\"]}' 
 --gpu-memory-utilization 0.90 
 --tool-call-parser deepseek_v4 
 --reasoning-parser deepseek_v4 
 --tokenizer-mode deepseek_v4 
 --enable-auto-tool-choice 
 --max-num-seqs 1024 
 --trust-remote-code

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

Vultr Cloud GPU

NVIDIA HGX B200

Deploy NVIDIA B200 on Vultr Cloud GPU