Qwen3-VL (235B)

Updated on 17 March, 2026

Deploy Qwen3-VL-235B-A22B-Instruct (Vision-Language model) on AMD Instinct GPUs.


Model Overview

Property Value
Model ID Qwen/Qwen3-VL-235B-A22B-Instruct
Architecture MoE with GQA + Vision Encoder
Total Parameters 235B
Active Parameters ~22B
Type Vision-Language Model
Context Length 256,000 tokens

Quick Start

Warning
This model does not support FP8 quantization on ROCm due to vision encoder dimension constraints. Use BF16.
bash
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --kv-offloading-backend native \
  --kv-offloading-size 64 \
  --disable-hybrid-kv-cache-manager

Important Configuration

[WARNING] Required Environment Variable

bash
export VLLM_USE_TRITON_FLASH_ATTN=0

Vision-Language models require this flag to be disabled.

Memory Usage

Metric Value
Model Memory ~70 GB
Load Time ~260 seconds
Max Context 32,768 tokens

Performance (MI325X Verified)

Concurrent Throughput p99 Latency
10 1,902 tok/s 9.24s
50 6,961 tok/s 12.50s
100 11,198 tok/s 15.46s
200 11,193 tok/s 15.46s
500 11,209 tok/s 15.44s

Multi-run means (n=5).

Peak Performance

  • Peak Throughput: 11,218 tok/s at 1,000 concurrent
  • Saturation Point: ~100 concurrent requests
  • 100% Success Rate at all tested concurrency levels

See Qwen3-VL Stress Testing for detailed benchmark results including saturation testing up to 1,000 concurrent requests.

Test Endpoints

Text Chat

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-235B-A22B-Instruct",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 100
  }'

Vision Chat (Image URL)

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-235B-A22B-Instruct",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=640"}}
      ]
    }],
    "max_tokens": 200
  }'

Vision Chat (Base64 Image)

bash
# Encode image to base64
IMAGE_BASE64=$(base64 -w 0 your_image.jpg)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-235B-A22B-Instruct",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image in detail."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,'$IMAGE_BASE64'"}}
      ]
    }],
    "max_tokens": 500
  }'

Why FP8 Doesn't Work

The vision encoder MLP has dimensions not compatible with ROCm's FP8 kernels:

RuntimeError: mat2 shape (1152x538) must be divisible by 16

The language model portion supports FP8, but the vision encoder requires BF16. Use BF16 for the entire model.

Comparison with Text-Only Models

Model FP8 KV Offload Throughput (200 conc)
Qwen3-VL-235B No Yes 11,193 tok/s
Llama-3.1-405B Yes Yes 6,674 tok/s
DeepSeek V3.2 Yes No 5,486 tok/s

Qwen3-VL achieves exceptional throughput due to its MoE architecture with only 22B active parameters per token.

Troubleshooting

Image Processing Errors

Ensure VLLM_USE_TRITON_FLASH_ATTN=0 is set:

bash
--env "VLLM_USE_TRITON_FLASH_ATTN=0"

FP8 Quantization Error

Do not use --quantization fp8 or --kv-cache-dtype fp8 with this model. Use BF16 (default).

OOM During Image Processing

Large images increase memory usage. Consider:

  • Resizing images before sending
  • Reducing --max-model-len
  • Using --gpu-memory-utilization 0.95

Comments