Focus Mode

Qwen3-VL (235B)

Updated on 17 March, 2026

Deploy Qwen3-VL-235B-A22B-Instruct (Vision-Language model) on AMD Instinct GPUs.

Model Overview

Property	Value
Model ID	`Qwen/Qwen3-VL-235B-A22B-Instruct`
Architecture	MoE with GQA + Vision Encoder
Total Parameters	235B
Active Parameters	~22B
Type	Vision-Language Model
Context Length	256,000 tokens

Quick Start

Warning

This model does not support FP8 quantization on ROCm due to vision encoder dimension constraints. Use BF16.

                            bash
                            
                        
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --kv-offloading-backend native \
  --kv-offloading-size 64 \
  --disable-hybrid-kv-cache-manager

Important Configuration

[WARNING] Required Environment Variable
bash
export VLLM_USE_TRITON_FLASH_ATTN=0
Vision-Language models require this flag to be disabled.

Memory Usage

Metric	Value
Model Memory	~70 GB
Load Time	~260 seconds
Max Context	32,768 tokens

Performance (MI325X Verified)

Concurrent	Throughput	p99 Latency
10	1,902 tok/s	9.24s
50	6,961 tok/s	12.50s
100	11,198 tok/s	15.46s
200	11,193 tok/s	15.46s
500	11,209 tok/s	15.44s

Multi-run means (n=5).

Peak Performance

Peak Throughput: 11,218 tok/s at 1,000 concurrent
Saturation Point: ~100 concurrent requests
100% Success Rate at all tested concurrency levels

See Qwen3-VL Stress Testing for detailed benchmark results including saturation testing up to 1,000 concurrent requests.

Test Endpoints

Text Chat

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-235B-A22B-Instruct",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 100
  }'

Vision Chat (Image URL)

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-235B-A22B-Instruct",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=640"}}
      ]
    }],
    "max_tokens": 200
  }'

Vision Chat (Base64 Image)

                            bash
                            
                        
# Encode image to base64
IMAGE_BASE64=$(base64 -w 0 your_image.jpg)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-235B-A22B-Instruct",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image in detail."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,'$IMAGE_BASE64'"}}
      ]
    }],
    "max_tokens": 500
  }'

Why FP8 Doesn't Work

The vision encoder MLP has dimensions not compatible with ROCm's FP8 kernels:

RuntimeError: mat2 shape (1152x538) must be divisible by 16

The language model portion supports FP8, but the vision encoder requires BF16. Use BF16 for the entire model.

Comparison with Text-Only Models

Model	FP8	KV Offload	Throughput (200 conc)
Qwen3-VL-235B	No	Yes	11,193 tok/s
Llama-3.1-405B	Yes	Yes	6,674 tok/s
DeepSeek V3.2	Yes	No	5,486 tok/s

Qwen3-VL achieves exceptional throughput due to its MoE architecture with only 22B active parameters per token.

Troubleshooting

Image Processing Errors

Ensure VLLM_USE_TRITON_FLASH_ATTN=0 is set:

bash

--env "VLLM_USE_TRITON_FLASH_ATTN=0"

FP8 Quantization Error

Do not use --quantization fp8 or --kv-cache-dtype fp8 with this model. Use BF16 (default).

OOM During Image Processing

Large images increase memory usage. Consider:

Resizing images before sending
Reducing --max-model-len
Using --gpu-memory-utilization 0.95