Latest ContentInference Cookbook Model Library

DeepSeek V4 Flash

DeepSeek V4 Flash is a large-scale Mixture-of-Experts model optimized for ultra-long context reasoning and efficient inference. It features 284B total parameters with approximately 13B activated, using 256 routed experts with 6 selected per token across a 43-layer architecture with 4,096 hidden size and 64 attention heads. Built with hybrid CSA + HCA attention and manifold-constrained hyper-connections, it supports up to a 1M token context window. With FP4 and FP8 mixed precision and strong agentic tool-calling capabilities, it is designed for scalable, high-efficiency reasoning and multi-domain workloads.

Type	MoE LLM
Capabilities	Text Generation, Instruction Following, Reasoning, Mathematical Reasoning+5 more
Release Date	24 April, 2026
Links	Blog\|HF Model Card
License	MIT

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE

docker run -it --rm 
 --runtime=nvidia 
 --gpus all 
 --ipc=host 
 --shm-size=128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 vllm/vllm-openai:v0.20.0 
 deepseek-ai/DeepSeek-V4-Flash 
 --attention_config.use_fp4_indexer_cache=True 
 --kv-cache-dtype fp8 
 --block-size 256 
 --tensor-parallel-size 4 
 --enable-expert-parallel 
 --max-model-len auto 
  --max-num-batched-tokens 65536 
 --compilation-config '{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\", \"custom_ops\":[\"all\"]}' 
 --gpu-memory-utilization 0.90 
 --tool-call-parser deepseek_v4 
 --reasoning-parser deepseek_v4 
 --tokenizer-mode deepseek_v4 
 --enable-auto-tool-choice 
 --max-num-seqs 1024 
 --trust-remote-code

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

DeepSeek V4 Flash

Inference Instructions

Model Benchmarks

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

NVIDIA HGX B200

Products

Features

Solutions

Marketplace

Resources

Company

Tech Talks

Vultr Blogs