Deploy Qwen3-VL-235B-A22B-Instruct (Vision-Language model) on AMD Instinct GPUs.
| Property | Value |
|---|---|
| Model ID | Qwen/Qwen3-VL-235B-A22B-Instruct |
| Architecture | MoE with GQA + Vision Encoder |
| Total Parameters | 235B |
| Active Parameters | ~22B |
| Type | Vision-Language Model |
| Context Length | 256,000 tokens |
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--kv-offloading-backend native \
--kv-offloading-size 64 \
--disable-hybrid-kv-cache-manager
[WARNING] Required Environment Variable
bashexport VLLM_USE_TRITON_FLASH_ATTN=0Vision-Language models require this flag to be disabled.
| Metric | Value |
|---|---|
| Model Memory | ~70 GB |
| Load Time | ~260 seconds |
| Max Context | 32,768 tokens |
| Concurrent | Throughput | p99 Latency |
|---|---|---|
| 10 | 1,902 tok/s | 9.24s |
| 50 | 6,961 tok/s | 12.50s |
| 100 | 11,198 tok/s | 15.46s |
| 200 | 11,193 tok/s | 15.46s |
| 500 | 11,209 tok/s | 15.44s |
Multi-run means (n=5).
See Qwen3-VL Stress Testing for detailed benchmark results including saturation testing up to 1,000 concurrent requests.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-235B-A22B-Instruct",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 100
}'
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-235B-A22B-Instruct",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=640"}}
]
}],
"max_tokens": 200
}'
# Encode image to base64
IMAGE_BASE64=$(base64 -w 0 your_image.jpg)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-235B-A22B-Instruct",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail."},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,'$IMAGE_BASE64'"}}
]
}],
"max_tokens": 500
}'
The vision encoder MLP has dimensions not compatible with ROCm's FP8 kernels:
RuntimeError: mat2 shape (1152x538) must be divisible by 16The language model portion supports FP8, but the vision encoder requires BF16. Use BF16 for the entire model.
| Model | FP8 | KV Offload | Throughput (200 conc) |
|---|---|---|---|
| Qwen3-VL-235B | No | Yes | 11,193 tok/s |
| Llama-3.1-405B | Yes | Yes | 6,674 tok/s |
| DeepSeek V3.2 | Yes | No | 5,486 tok/s |
Qwen3-VL achieves exceptional throughput due to its MoE architecture with only 22B active parameters per token.
Ensure VLLM_USE_TRITON_FLASH_ATTN=0 is set:
--env "VLLM_USE_TRITON_FLASH_ATTN=0"
Do not use --quantization fp8 or --kv-cache-dtype fp8 with this model. Use BF16 (default).
Large images increase memory usage. Consider:
--max-model-len--gpu-memory-utilization 0.95