DeepSeek V3.2 (685B)

Updated on 17 March, 2026

Deploy DeepSeek V3.2 (685B parameters) on AMD Instinct GPUs.


Model Overview

Property Value
Model ID deepseek-ai/DeepSeek-V3.2
Architecture MoE with MLA (Multi-head Latent Attention)
Total Parameters 685B
Context Length 163,840 tokens
Download Size ~254 GB

Quick Start

bash
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_ROCM_USE_AITER=1" \
  --env "AITER_ENABLE_VSKIP=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 8 \
  --block-size 1 \
  --quantization fp8

Critical Configuration

Warning
These flags are mandatory for DeepSeek V3.2 on AMD:
  • --block-size 1 - Required for MLA architecture (will error without it)
  • AITER_ENABLE_VSKIP=0 - Prevents crashes on MI300X/MI325X
  • VLLM_ROCM_USE_AITER=1 - Enables optimized kernels

Full Features Configuration

Enable chat, tool calling, and reasoning mode:

bash
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_ROCM_USE_AITER=1" \
  --env "AITER_ENABLE_VSKIP=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 8 \
  --block-size 1 \
  --quantization fp8 \
  --tokenizer-mode deepseek_v32 \
  --tool-call-parser deepseek_v32 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v3

Additional Flags Explained

Flag Purpose
--tokenizer-mode deepseek_v32 Enables chat completions endpoint
--tool-call-parser deepseek_v32 Parses tool/function calls
--enable-auto-tool-choice Allows model to decide when to use tools
--reasoning-parser deepseek_v3 Extracts reasoning blocks

Memory Usage

Configuration Total Memory Per GPU (TP=8)
FP8 ~83 GB ~10 GB
FP16 ~180 GB ~22 GB
Metric Value
Load Time ~348 seconds (~5.8 minutes)
FP8 BMM Warmup ~3 minutes (first run only)

MI325X (256GB) easily fits the model with room for KV cache.

Performance (MI325X Verified)

Concurrency Scaling

Concurrent Throughput p99 Latency
10 2,857 tok/s 22.76s
50 5,694 tok/s 23.49s
100 5,518 tok/s 24.22s
200 5,486 tok/s 24.14s
500 5,657 tok/s 23.46s

Multi-run means (n=5).

Peak Performance

  • Peak Throughput: 5,786 tok/s at 1,000 concurrent
  • Saturation Point: ~50 concurrent requests
  • 100% Success Rate at all tested concurrency levels

See DeepSeek V3.2 Stress Testing for detailed results including saturation testing.

Test Endpoints

Chat Completion

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.2",
    "messages": [{"role": "user", "content": "Hello, who are you?"}]
  }'

Tool Calling

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.2",
    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }]
  }'

Reasoning Mode

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.2",
    "messages": [{"role": "user", "content": "Solve step by step: If a train travels 120km in 2 hours, and then 180km in 3 hours, what is the average speed?"}]
  }'

Known Limitations

KV Cache Offloading Not Supported

DeepSeek V3.2 uses MLA (Multi-head Latent Attention) which is incompatible with vLLM's KV cache offloading:

Error: KeyError: 'model.layers.0.self_attn.indexer.k_cache'

The MLA architecture uses an indexer-based KV cache that the OffloadingConnector cannot handle. Use the large HBM capacity instead (256GB per MI325X is sufficient).

FP8 KV Cache Not Supported

Do not use --kv-cache-dtype fp8:

Error: ROCMAiterMLASparseBackend doesn't support fp8 kv_cache_dtype

vLLM automatically uses the correct fp8_ds_mla format for DeepSeek models.

Troubleshooting

Crash on Startup

Ensure AITER_ENABLE_VSKIP=0 is set. If unset, it defaults to true which causes crashes on MI300X/MI325X.

Chat Endpoint Returns Error

Add --tokenizer-mode deepseek_v32 to enable the chat completions endpoint.

Long Warmup Time

FP8 BMM kernel pre-compilation takes ~3 minutes on first run. Subsequent starts use cached kernels.

Comments