Deploy DeepSeek V3.2 (685B parameters) on AMD Instinct GPUs.
| Property | Value |
|---|---|
| Model ID | deepseek-ai/DeepSeek-V3.2 |
| Architecture | MoE with MLA (Multi-head Latent Attention) |
| Total Parameters | 685B |
| Context Length | 163,840 tokens |
| Download Size | ~254 GB |
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--env "VLLM_ROCM_USE_AITER=1" \
--env "AITER_ENABLE_VSKIP=0" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--block-size 1 \
--quantization fp8
--block-size 1 - Required for MLA architecture (will error without it)AITER_ENABLE_VSKIP=0 - Prevents crashes on MI300X/MI325XVLLM_ROCM_USE_AITER=1 - Enables optimized kernelsEnable chat, tool calling, and reasoning mode:
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--env "VLLM_ROCM_USE_AITER=1" \
--env "AITER_ENABLE_VSKIP=0" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--block-size 1 \
--quantization fp8 \
--tokenizer-mode deepseek_v32 \
--tool-call-parser deepseek_v32 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v3
| Flag | Purpose |
|---|---|
--tokenizer-mode deepseek_v32 |
Enables chat completions endpoint |
--tool-call-parser deepseek_v32 |
Parses tool/function calls |
--enable-auto-tool-choice |
Allows model to decide when to use tools |
--reasoning-parser deepseek_v3 |
Extracts reasoning blocks |
| Configuration | Total Memory | Per GPU (TP=8) |
|---|---|---|
| FP8 | ~83 GB | ~10 GB |
| FP16 | ~180 GB | ~22 GB |
| Metric | Value |
|---|---|
| Load Time | ~348 seconds (~5.8 minutes) |
| FP8 BMM Warmup | ~3 minutes (first run only) |
MI325X (256GB) easily fits the model with room for KV cache.
| Concurrent | Throughput | p99 Latency |
|---|---|---|
| 10 | 2,857 tok/s | 22.76s |
| 50 | 5,694 tok/s | 23.49s |
| 100 | 5,518 tok/s | 24.22s |
| 200 | 5,486 tok/s | 24.14s |
| 500 | 5,657 tok/s | 23.46s |
Multi-run means (n=5).
See DeepSeek V3.2 Stress Testing for detailed results including saturation testing.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.2",
"messages": [{"role": "user", "content": "Hello, who are you?"}]
}'
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.2",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
}'
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.2",
"messages": [{"role": "user", "content": "Solve step by step: If a train travels 120km in 2 hours, and then 180km in 3 hours, what is the average speed?"}]
}'
DeepSeek V3.2 uses MLA (Multi-head Latent Attention) which is incompatible with vLLM's KV cache offloading:
Error: KeyError: 'model.layers.0.self_attn.indexer.k_cache'The MLA architecture uses an indexer-based KV cache that the OffloadingConnector cannot handle. Use the large HBM capacity instead (256GB per MI325X is sufficient).
Do not use --kv-cache-dtype fp8:
Error: ROCMAiterMLASparseBackend doesn't support fp8 kv_cache_dtypevLLM automatically uses the correct fp8_ds_mla format for DeepSeek models.
Ensure AITER_ENABLE_VSKIP=0 is set. If unset, it defaults to true which causes crashes on MI300X/MI325X.
Add --tokenizer-mode deepseek_v32 to enable the chat completions endpoint.
FP8 BMM kernel pre-compilation takes ~3 minutes on first run. Subsequent starts use cached kernels.