Deploy DeepSeek's V3.2 on NVIDIA HGX B200 GPUs. This MoE model uses Multi-Latent Attention (MLA) for compressed KV caching, delivering strong reasoning performance at 685B parameters.
| Property | Value |
|---|---|
| Model ID | deepseek-ai/DeepSeek-V3-0324 |
| Architecture | MoE + Multi-Latent Attention (MLA) |
| Total Parameters | 685B |
| Active Parameters | ~37B per token |
| Attention | Multi-Latent Attention (compressed KV) |
| Routing | MoE (256 experts, 8 active per token) |
| Context Length | 128K tokens |
| Quantization | On-the-fly FP8 via --quantization fp8 |
| License | MIT |
| Link | HuggingFace |
DeepSeek V3.2 uses Multi-Latent Attention (MLA), which compresses KV projections into a lower-dimensional latent space before caching:
This compression means:
--block-size 1 is requiredCombined with MoE routing (~37B active of 685B total), DeepSeek V3.2 is memory-efficient for its size.
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code \
--quantization fp8 \
--block-size 1
Or with Docker:
$ docker run --rm --gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:v0.16.0 \
--model deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code \
--quantization fp8 \
--block-size 1
--block-size 1 is required for MLA models. --quantization fp8 applies on-the-fly FP8 quantization since the base model is BF16.
| Flag | Purpose |
|---|---|
--tensor-parallel-size 8 |
Full 8-GPU deployment for 685B model |
--max-model-len 32768 |
Context window. Supports up to 128K |
--gpu-memory-utilization 0.90 |
Reserve 90% of VRAM |
--trust-remote-code |
Required for MLA implementation |
--quantization fp8 |
On-the-fly FP8 quantization (no pre-quantized checkpoint) |
--block-size 1 |
Required for MLA KV cache format |
With TP=8 on FP8 (on-the-fly quantization):
| Component | Per GPU | Total (8 GPUs) |
|---|---|---|
| Model weights | ~86 GB | ~685 GB |
| KV cache (available) | ~75 GB | ~602 GB |
| VRAM used | ~161 GB | ~1,288 GB |
Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=8 on 8x NVIDIA HGX B200.
| Concurrent | Output tok/s | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|---|
| 1 | 106 | 92 | 9.28 | 9.46 |
| 8 | 503 | 1,311 | 13.36 | 12.90 |
| 16 | 930 | 618 | 16.00 | 15.47 |
| 32 | 1,382 | 931 | 21.34 | 22.32 |
| 64 | 2,281 | 843 | 26.39 | 143.82 |
| 128 | 3,545 | 1,263 | 33.52 | 222.37 |
| 256 | 4,010 | 2,607 | 58.45 | 232.63 |
| 512 | 4,321 | 7,397 | 98.86 | 241.35 |
| 1024 | 4,370 | 58,298 | 104.21 | 243.37 |
| Metric | Value |
|---|---|
| Peak sustained throughput | 4,370 tok/s (c=1024) |
| Peak burst throughput | 9,216 tok/s |
| Saturation point | ~512 concurrent |
$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3-0324",
"messages": [{"role": "user", "content": "Explain Multi-Latent Attention"}],
"max_tokens": 256
}'
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3-0324",
"prompt": "The advantage of MLA over standard GQA is",
"max_tokens": 128
}'
--quantization fp8. This adds to startup time but avoids needing a separate FP8 checkpoint.--block-size 1 will produce a clear error message.