Focus Mode

DeepSeek V3.2 (685B)

Updated on 17 March, 2026

Deploy DeepSeek V3.2 (685B parameters) on AMD Instinct GPUs.

Model Overview

Property	Value
Model ID	`deepseek-ai/DeepSeek-V3.2`
Architecture	MoE with MLA (Multi-head Latent Attention)
Total Parameters	685B
Context Length	163,840 tokens
Download Size	~254 GB

Quick Start

                            bash
                            
                        
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_ROCM_USE_AITER=1" \
  --env "AITER_ENABLE_VSKIP=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 8 \
  --block-size 1 \
  --quantization fp8

Critical Configuration

Warning

These flags are mandatory for DeepSeek V3.2 on AMD:

--block-size 1 - Required for MLA architecture (will error without it)
AITER_ENABLE_VSKIP=0 - Prevents crashes on MI300X/MI325X
VLLM_ROCM_USE_AITER=1 - Enables optimized kernels

Full Features Configuration

Enable chat, tool calling, and reasoning mode:

                            bash
                            
                        
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_ROCM_USE_AITER=1" \
  --env "AITER_ENABLE_VSKIP=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 8 \
  --block-size 1 \
  --quantization fp8 \
  --tokenizer-mode deepseek_v32 \
  --tool-call-parser deepseek_v32 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v3

Additional Flags Explained

Flag	Purpose
`--tokenizer-mode deepseek_v32`	Enables chat completions endpoint
`--tool-call-parser deepseek_v32`	Parses tool/function calls
`--enable-auto-tool-choice`	Allows model to decide when to use tools
`--reasoning-parser deepseek_v3`	Extracts reasoning blocks

Memory Usage

Configuration	Total Memory	Per GPU (TP=8)
FP8	~83 GB	~10 GB
FP16	~180 GB	~22 GB

Metric	Value
Load Time	~348 seconds (~5.8 minutes)
FP8 BMM Warmup	~3 minutes (first run only)

MI325X (256GB) easily fits the model with room for KV cache.

Performance (MI325X Verified)

Concurrency Scaling

Concurrent	Throughput	p99 Latency
10	2,857 tok/s	22.76s
50	5,694 tok/s	23.49s
100	5,518 tok/s	24.22s
200	5,486 tok/s	24.14s
500	5,657 tok/s	23.46s

Multi-run means (n=5).

Peak Performance

Peak Throughput: 5,786 tok/s at 1,000 concurrent
Saturation Point: ~50 concurrent requests
100% Success Rate at all tested concurrency levels

See DeepSeek V3.2 Stress Testing for detailed results including saturation testing.

Test Endpoints

Chat Completion

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.2",
    "messages": [{"role": "user", "content": "Hello, who are you?"}]
  }'

Tool Calling

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.2",
    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }]
  }'

Reasoning Mode

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.2",
    "messages": [{"role": "user", "content": "Solve step by step: If a train travels 120km in 2 hours, and then 180km in 3 hours, what is the average speed?"}]
  }'

Known Limitations

KV Cache Offloading Not Supported

DeepSeek V3.2 uses MLA (Multi-head Latent Attention) which is incompatible with vLLM's KV cache offloading:

Error: KeyError: 'model.layers.0.self_attn.indexer.k_cache'

The MLA architecture uses an indexer-based KV cache that the OffloadingConnector cannot handle. Use the large HBM capacity instead (256GB per MI325X is sufficient).

FP8 KV Cache Not Supported

Do not use --kv-cache-dtype fp8:

Error: ROCMAiterMLASparseBackend doesn't support fp8 kv_cache_dtype

vLLM automatically uses the correct fp8_ds_mla format for DeepSeek models.

Troubleshooting

Crash on Startup

Ensure AITER_ENABLE_VSKIP=0 is set. If unset, it defaults to true which causes crashes on MI300X/MI325X.

Chat Endpoint Returns Error

Add --tokenizer-mode deepseek_v32 to enable the chat completions endpoint.

Long Warmup Time

FP8 BMM kernel pre-compilation takes ~3 minutes on first run. Subsequent starts use cached kernels.