Focus Mode

Qwen3-VL (235B) Validation Testing

Updated on 11 March, 2026

Validation benchmarks for Qwen3-VL-235B-A22B-Instruct (Vision-Language model, 235B parameters, 22B active) on 8x AMD Instinct MI325X GPUs.

Concurrency Scaling

Scaling Results

Concurrent	Throughput	Output tok/s	p99 Latency	Status
5	1,915 tok/s	286	3.50s	DEGRADED
10	3,577 tok/s	534	3.74s	DEGRADED
25	7,824 tok/s	1,167	4.28s	DEGRADED
50	13,136 tok/s	1,959	5.08s	DEGRADED
75	13,677 tok/s	2,040	4.88s	DEGRADED
100	13,132 tok/s	1,958	5.07s	DEGRADED

Observations:

Excellent scaling from 5 to 75 concurrent requests
Peak scaling throughput of 13,677 tok/s at 75 concurrent
MoE architecture (22B active parameters) enables efficient batching
DEGRADED status indicates p99 latency >2x baseline (expected under concurrent load)

Stress Tests

Test Type	Mode	Throughput	Output tok/s	p99 Latency	Status
long_output	text	995 tok/s	839	8.93s	OK
long_context	text	5,249 tok/s	403	3.47s	OK
multi_image_3	multi-image	4,270 tok/s	354	5.93s	OK
high_conc_vision	vision	9,546 tok/s	1,987	7.52s	OK

Key findings:

Long output generation: 995 tok/s with 8.93s p99 latency
Long context (4K tokens): 5,249 tok/s with 3.47s p99 latency
Multi-image (3 images): 4,270 tok/s with 5.93s p99 latency
High concurrency vision (100 concurrent): 9,546 tok/s

Saturation Testing

Qwen3-VL Validation Saturation

Extreme Load Results

Concurrent	Throughput	Success Rate	p99 Latency	Status
150	17,810 tok/s	100%	5.61s	OK
200	17,553 tok/s	100%	5.70s	SATURATED
300	17,569 tok/s	100%	5.69s	SATURATED
500	17,707 tok/s	100%	5.66s	SATURATED

Observations:

Peak throughput of 17,810 tok/s achieved at 150 concurrent
Saturation begins at 200 concurrent requests
100% success rate maintained even under extreme load
Consistent ~17,500 tok/s across 150-500 concurrent

Recommendations

Use Case	Concurrency	Expected Throughput
Low latency	5–10	1,900–3,500 tok/s
Balanced	25–50	7,800–13,100 tok/s
High throughput	75–150	13,600–17,800 tok/s

Test Configuration

Parameter	Value
Model	Qwen/Qwen3-VL-235B-A22B-Instruct
Test Mode	quick
Timestamp	20260128_190627
Vision Model	Yes

Launch Command

                            bash
                            
                        
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --kv-offloading-backend native \
  --kv-offloading-size 64 \
  --disable-hybrid-kv-cache-manager

Test Environment

Specification	Value
GPU	8x AMD Instinct MI325X
VRAM	256 GB HBM3E per GPU (2 TB total)
Architecture	CDNA 3 (gfx942)
ROCm	6.4.2-120
vLLM	0.14.1

Qwen3-VL (235B) Validation Testing

Concurrency Scaling

Scaling Results

Stress Tests

Saturation Testing

Extreme Load Results

Recommendations

Test Configuration

Launch Command

Test Environment

Comments

Products

Features

Solutions

Marketplace

Resources

Company