DeepSeek V3.2

Updated on 11 March, 2026

Deploy DeepSeek's V3.2 on NVIDIA HGX B200 GPUs. This MoE model uses Multi-Latent Attention (MLA) for compressed KV caching, delivering strong reasoning performance at 685B parameters.


Model Overview

Property Value
Model ID deepseek-ai/DeepSeek-V3-0324
Architecture MoE + Multi-Latent Attention (MLA)
Total Parameters 685B
Active Parameters ~37B per token
Attention Multi-Latent Attention (compressed KV)
Routing MoE (256 experts, 8 active per token)
Context Length 128K tokens
Quantization On-the-fly FP8 via --quantization fp8
License MIT
Link HuggingFace

Architecture

DeepSeek V3.2 uses Multi-Latent Attention (MLA), which compresses KV projections into a lower-dimensional latent space before caching:

  • Standard GQA: Stores full key and value tensors per head per token
  • MLA: Projects KV into a shared latent vector, then reconstructs on the fly

This compression means:

  • Much less KV cache per token than a standard 685B model
  • Higher memory efficiency at long context lengths
  • Standard KV cache offloading is not supported (different cache format)
  • --block-size 1 is required

Combined with MoE routing (~37B active of 685B total), DeepSeek V3.2 is memory-efficient for its size.

Quick Start

console
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --quantization fp8 \
  --block-size 1

Or with Docker:

console
$ docker run --rm --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.16.0 \
  --model deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --quantization fp8 \
  --block-size 1
Important
--block-size 1 is required for MLA models. --quantization fp8 applies on-the-fly FP8 quantization since the base model is BF16.

Configuration

Flag Purpose
--tensor-parallel-size 8 Full 8-GPU deployment for 685B model
--max-model-len 32768 Context window. Supports up to 128K
--gpu-memory-utilization 0.90 Reserve 90% of VRAM
--trust-remote-code Required for MLA implementation
--quantization fp8 On-the-fly FP8 quantization (no pre-quantized checkpoint)
--block-size 1 Required for MLA KV cache format

Memory Usage (NVIDIA HGX B200 Verified)

With TP=8 on FP8 (on-the-fly quantization):

Component Per GPU Total (8 GPUs)
Model weights ~86 GB ~685 GB
KV cache (available) ~75 GB ~602 GB
VRAM used ~161 GB ~1,288 GB
Note
vLLM reported 75.23 GiB available for KV cache, totalling 1,149,440 tokens of cache capacity. MLA's compressed KV format means each token uses significantly less cache than standard GQA.

Performance (NVIDIA HGX B200 Verified)

Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=8 on 8x NVIDIA HGX B200.

Concurrency Scaling

Concurrent Output tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 106 92 9.28 9.46
8 503 1,311 13.36 12.90
16 930 618 16.00 15.47
32 1,382 931 21.34 22.32
64 2,281 843 26.39 143.82
128 3,545 1,263 33.52 222.37
256 4,010 2,607 58.45 232.63
512 4,321 7,397 98.86 241.35
1024 4,370 58,298 104.21 243.37

Peak Performance

Metric Value
Peak sustained throughput 4,370 tok/s (c=1024)
Peak burst throughput 9,216 tok/s
Saturation point ~512 concurrent

Key Observations

  • Strong scaling to c=512 — DeepSeek V3.2 scales nearly linearly up to 256 concurrent requests and continues gaining throughput at 512. The 4,370 tok/s at c=1024 is only marginally better than c=512 (4,321 tok/s), indicating saturation around 512.
  • MLA efficiency — MLA's compressed KV cache provides 1,149,440 tokens of cache capacity (75.23 GiB per GPU), enabling high concurrency without running out of KV cache.
  • Low single-request latency — TPOT of 9.28ms at c=1 is among the lowest of all models tested, demonstrating MLA's efficient decode phase.
  • TTFT remains manageable — Even at c=256, TTFT stays under 3 seconds, making DeepSeek V3.2 suitable for interactive workloads at moderate concurrency.

Test Endpoints

Chat Completion

console
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3-0324",
    "messages": [{"role": "user", "content": "Explain Multi-Latent Attention"}],
    "max_tokens": 256
  }'

Text Completion

console
$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3-0324",
    "prompt": "The advantage of MLA over standard GQA is",
    "max_tokens": 128
  }'

Known Issues

  • vLLM 0.16.0+ recommended — While DeepSeek V3 is supported in earlier versions, vLLM 0.16.0 adds FlashInfer MLA backend and TRT-LLM ragged prefill optimizations that significantly improve performance.
  • On-the-fly quantization: Unlike the other models which use pre-quantized checkpoints, DeepSeek V3.2 is quantized at load time via --quantization fp8. This adds to startup time but avoids needing a separate FP8 checkpoint.
  • block-size 1 required: MLA's compressed KV format is incompatible with the default block size. Forgetting --block-size 1 will produce a clear error message.
  • No KV cache offloading: MLA's custom cache format doesn't support standard vLLM KV cache offloading. Use Dynamo's tiered KV caching instead if you need to extend beyond GPU VRAM.
  • Large download: The base model is ~642 GB in BF16 weights (quantized to FP8 at load time).

Comments