Qwen3-VL (235B) Validation Testing

Updated on 11 March, 2026

Validation benchmarks for Qwen3-VL-235B-A22B-Instruct (Vision-Language model, 235B parameters, 22B active) on 8x AMD Instinct MI325X GPUs.


Concurrency Scaling

Concurrency Scaling

Scaling Results

Concurrent Throughput Output tok/s p99 Latency Status
5 1,915 tok/s 286 3.50s DEGRADED
10 3,577 tok/s 534 3.74s DEGRADED
25 7,824 tok/s 1,167 4.28s DEGRADED
50 13,136 tok/s 1,959 5.08s DEGRADED
75 13,677 tok/s 2,040 4.88s DEGRADED
100 13,132 tok/s 1,958 5.07s DEGRADED

Observations:

  • Excellent scaling from 5 to 75 concurrent requests
  • Peak scaling throughput of 13,677 tok/s at 75 concurrent
  • MoE architecture (22B active parameters) enables efficient batching
  • DEGRADED status indicates p99 latency >2x baseline (expected under concurrent load)

Stress Tests

Test Type Mode Throughput Output tok/s p99 Latency Status
long_output text 995 tok/s 839 8.93s OK
long_context text 5,249 tok/s 403 3.47s OK
multi_image_3 multi-image 4,270 tok/s 354 5.93s OK
high_conc_vision vision 9,546 tok/s 1,987 7.52s OK

Key findings:

  • Long output generation: 995 tok/s with 8.93s p99 latency
  • Long context (4K tokens): 5,249 tok/s with 3.47s p99 latency
  • Multi-image (3 images): 4,270 tok/s with 5.93s p99 latency
  • High concurrency vision (100 concurrent): 9,546 tok/s

Saturation Testing

Qwen3-VL Validation Saturation

Extreme Load Results

Concurrent Throughput Success Rate p99 Latency Status
150 17,810 tok/s 100% 5.61s OK
200 17,553 tok/s 100% 5.70s SATURATED
300 17,569 tok/s 100% 5.69s SATURATED
500 17,707 tok/s 100% 5.66s SATURATED

Observations:

  • Peak throughput of 17,810 tok/s achieved at 150 concurrent
  • Saturation begins at 200 concurrent requests
  • 100% success rate maintained even under extreme load
  • Consistent ~17,500 tok/s across 150-500 concurrent

Recommendations

Use Case Concurrency Expected Throughput
Low latency 5–10 1,900–3,500 tok/s
Balanced 25–50 7,800–13,100 tok/s
High throughput 75–150 13,600–17,800 tok/s

Test Configuration

Parameter Value
Model Qwen/Qwen3-VL-235B-A22B-Instruct
Test Mode quick
Timestamp 20260128_190627
Vision Model Yes

Launch Command

bash
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --kv-offloading-backend native \
  --kv-offloading-size 64 \
  --disable-hybrid-kv-cache-manager

Test Environment

Specification Value
GPU 8x AMD Instinct MI325X
VRAM 256 GB HBM3E per GPU (2 TB total)
Architecture CDNA 3 (gfx942)
ROCm 6.4.2-120
vLLM 0.14.1

Comments