Focus Mode

DeepSeek V3.2

Updated on 11 March, 2026

Deploy DeepSeek's V3.2 on NVIDIA HGX B200 GPUs. This MoE model uses Multi-Latent Attention (MLA) for compressed KV caching, delivering strong reasoning performance at 685B parameters.

Model Overview

Property	Value
Model ID	`deepseek-ai/DeepSeek-V3-0324`
Architecture	MoE + Multi-Latent Attention (MLA)
Total Parameters	685B
Active Parameters	~37B per token
Attention	Multi-Latent Attention (compressed KV)
Routing	MoE (256 experts, 8 active per token)
Context Length	128K tokens
Quantization	On-the-fly FP8 via `--quantization fp8`
License	MIT
Link	HuggingFace

Architecture

DeepSeek V3.2 uses Multi-Latent Attention (MLA), which compresses KV projections into a lower-dimensional latent space before caching:

Standard GQA: Stores full key and value tensors per head per token
MLA: Projects KV into a shared latent vector, then reconstructs on the fly

This compression means:

Much less KV cache per token than a standard 685B model
Higher memory efficiency at long context lengths
Standard KV cache offloading is not supported (different cache format)
--block-size 1 is required

Combined with MoE routing (~37B active of 685B total), DeepSeek V3.2 is memory-efficient for its size.

Quick Start

                            console
                            
                        
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --quantization fp8 \
  --block-size 1

Or with Docker:

                            console
                            
                        
$ docker run --rm --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.16.0 \
  --model deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --quantization fp8 \
  --block-size 1

Important

--block-size 1 is required for MLA models. --quantization fp8 applies on-the-fly FP8 quantization since the base model is BF16.

Configuration

Flag	Purpose
`--tensor-parallel-size 8`	Full 8-GPU deployment for 685B model
`--max-model-len 32768`	Context window. Supports up to 128K
`--gpu-memory-utilization 0.90`	Reserve 90% of VRAM
`--trust-remote-code`	Required for MLA implementation
`--quantization fp8`	On-the-fly FP8 quantization (no pre-quantized checkpoint)
`--block-size 1`	Required for MLA KV cache format

Memory Usage (NVIDIA HGX B200 Verified)

With TP=8 on FP8 (on-the-fly quantization):

Component	Per GPU	Total (8 GPUs)
Model weights	~86 GB	~685 GB
KV cache (available)	~75 GB	~602 GB
VRAM used	~161 GB	~1,288 GB

Note

vLLM reported 75.23 GiB available for KV cache, totalling 1,149,440 tokens of cache capacity. MLA's compressed KV format means each token uses significantly less cache than standard GQA.

Performance (NVIDIA HGX B200 Verified)

Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=8 on 8x NVIDIA HGX B200.

Concurrency Scaling

Concurrent	Output tok/s	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
1	106	92	9.28	9.46
8	503	1,311	13.36	12.90
16	930	618	16.00	15.47
32	1,382	931	21.34	22.32
64	2,281	843	26.39	143.82
128	3,545	1,263	33.52	222.37
256	4,010	2,607	58.45	232.63
512	4,321	7,397	98.86	241.35
1024	4,370	58,298	104.21	243.37

Peak Performance

Metric	Value
Peak sustained throughput	4,370 tok/s (c=1024)
Peak burst throughput	9,216 tok/s
Saturation point	~512 concurrent

Key Observations

Strong scaling to c=512 — DeepSeek V3.2 scales nearly linearly up to 256 concurrent requests and continues gaining throughput at 512. The 4,370 tok/s at c=1024 is only marginally better than c=512 (4,321 tok/s), indicating saturation around 512.
MLA efficiency — MLA's compressed KV cache provides 1,149,440 tokens of cache capacity (75.23 GiB per GPU), enabling high concurrency without running out of KV cache.
Low single-request latency — TPOT of 9.28ms at c=1 is among the lowest of all models tested, demonstrating MLA's efficient decode phase.
TTFT remains manageable — Even at c=256, TTFT stays under 3 seconds, making DeepSeek V3.2 suitable for interactive workloads at moderate concurrency.

Test Endpoints

Chat Completion

                            console
                            
                        
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3-0324",
    "messages": [{"role": "user", "content": "Explain Multi-Latent Attention"}],
    "max_tokens": 256
  }'

Text Completion

                            console
                            
                        
$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3-0324",
    "prompt": "The advantage of MLA over standard GQA is",
    "max_tokens": 128
  }'

Known Issues

vLLM 0.16.0+ recommended — While DeepSeek V3 is supported in earlier versions, vLLM 0.16.0 adds FlashInfer MLA backend and TRT-LLM ragged prefill optimizations that significantly improve performance.
On-the-fly quantization: Unlike the other models which use pre-quantized checkpoints, DeepSeek V3.2 is quantized at load time via --quantization fp8. This adds to startup time but avoids needing a separate FP8 checkpoint.
block-size 1 required: MLA's compressed KV format is incompatible with the default block size. Forgetting --block-size 1 will produce a clear error message.
No KV cache offloading: MLA's custom cache format doesn't support standard vLLM KV cache offloading. Use Dynamo's tiered KV caching instead if you need to extend beyond GPU VRAM.
Large download: The base model is ~642 GB in BF16 weights (quantized to FP8 at load time).

DeepSeek V3.2

Model Overview

Architecture

Quick Start

Configuration

Memory Usage (NVIDIA HGX B200 Verified)

Performance (NVIDIA HGX B200 Verified)

Concurrency Scaling

Peak Performance

Key Observations

Test Endpoints

Chat Completion

Text Completion

Known Issues

Comments

Products

Features

Solutions

Marketplace

Resources

Company