Concurrency Tuning

Updated on 17 March, 2026

Maximize throughput by tuning vLLM for high concurrent request loads.


Key Parameters

max_num_seqs

Maximum concurrent sequences in a single batch.

Value Use Case
128-256 Memory-constrained, large models
512-1024 Default, balanced performance
2048-4096 Maximum throughput, smaller models
bash
--max-num-seqs 2048

max_num_batched_tokens

Maximum total tokens across all sequences in a batch.

Value Trade-off
2048 Better inter-token latency
8192-16384 Balanced for interactive use
32768-65536 Maximum throughput
bash
--max-num-batched-tokens 32768

gpu_memory_utilization

Fraction of GPU memory allocated for KV cache.

Value Use Case
0.45 Multiple instances per GPU
0.90 Default, single instance
0.95 Maximum capacity
bash
--gpu-memory-utilization 0.95

Throughput Results (MI325X Verified)

All results below are multi-run means (n=5) using input=2048, output=512 tokens.

Qwen3-VL-235B (BF16)

Concurrent Throughput p99 Latency
10 1,902 tok/s 9.24s
50 6,961 tok/s 12.50s
100 11,198 tok/s 15.46s
200 11,193 tok/s 15.46s
500 11,209 tok/s 15.44s
750 11,208 tok/s 15.44s
1,000 11,218 tok/s 15.43s

Llama-3.1-405B (FP8)

Concurrent Throughput p99 Latency
10 1,090 tok/s 16.17s
50 4,381 tok/s 20.14s
100 6,802 tok/s 25.84s
200 6,674 tok/s 26.33s
500 6,804 tok/s 25.84s
750 6,808 tok/s 25.83s
1,000 6,798 tok/s 25.84s

DeepSeek V3.2 (685B, FP8+AITER)

Concurrent Throughput p99 Latency
10 2,857 tok/s 22.76s
50 5,694 tok/s 23.49s
100 5,518 tok/s 24.22s
200 5,486 tok/s 24.14s
500 5,657 tok/s 23.46s
750 5,550 tok/s 23.96s
1,000 5,786 tok/s 23.01s

Kimi-K2.5 (1T, INT4 QAT, TP=4)

Note
Kimi-K2.5 runs with AITER disabled and TP=4 (not TP=8) due to MLA head count constraints.
Concurrent Throughput p99 Latency
10 225 tok/s 77.83s
50 583 tok/s 149.39s
100 948 tok/s 183.35s
200 950 tok/s 182.96s
500 948 tok/s 183.23s
750 947 tok/s 183.54s
1,000 952 tok/s 182.52s

Peak Throughput & Saturation

Model Peak Throughput Saturation Point p99 @ Peak
Qwen3-VL-235B 11,218 tok/s ~100 concurrent 15.43s
Llama-3.1-405B 6,808 tok/s ~100 concurrent 25.83s
DeepSeek V3.2 5,786 tok/s ~50 concurrent 23.01s
Kimi-K2.5 (TP=4) 952 tok/s ~100 concurrent 182.52s

Multi-run means (n=5). 100% success rate at all concurrency levels up to 1,000.

Key Findings

  • Throughput saturates early: All models reach peak throughput by 100 concurrent requests; additional concurrency provides no benefit
  • FP8 enables large models like Llama-405B and DeepSeek V3.2 on 8x MI325X
  • Architecture matters: MoE models with fewer active parameters (Qwen 22B active) achieve higher throughput than dense models (Llama 405B)
  • Active parameters predict throughput better than total parameters
  • 100% reliability at all tested concurrency levels (up to 1,000) across all models

Cross-Model Analysis

Metric Qwen3-VL-235B Llama-3.1-405B DeepSeek V3.2 Kimi-K2.5 (TP=4)
Peak per GPU 1,402 tok/s 851 tok/s 723 tok/s 238 tok/s*
Scaling (10→200) 5.9x 6.1x 1.9x 4.2x
p99 latency @ peak 15.43s 25.83s 23.01s 182.52s

*Kimi-K2.5 uses only 4 GPUs (TP=4), so per-GPU throughput is calculated as 952/4.

Key insights:

  • MoE advantage: Qwen's sparse activation (22B of 235B) delivers the highest per-GPU throughput despite being run in BF16
  • Dense models scale more linearly: Llama achieves 6.1x scaling (10→200) vs DeepSeek's 1.9x
  • MLA latency cost: DeepSeek V3.2 and Kimi-K2.5 (both MLA) show higher p99 latency than GQA models
  • Kimi-K2.5 constraint: Limited to TP=4 due to AITER MLA head count requirements, using only half the GPU cluster

Concurrency Recommendations

Use Case Concurrency Throughput p99 Latency
Low latency 1-10 ~1,900 tok/s ~9s
Balanced 50 ~7,000 tok/s ~12s
High throughput 100+ ~11,200 tok/s ~15s

Values based on Qwen3-VL-235B multi-run means (n=5, input=2048, output=512). Other models follow similar saturation patterns — see per-model tables above.

Configuration Profiles

High-Throughput Configuration

bash
# Environment
export VLLM_ROCM_USE_AITER=1
export NCCL_MIN_NCHANNELS=112

# Serving
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_ROCM_USE_AITER=1" \
  --env "NCCL_MIN_NCHANNELS=112" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --tensor-parallel-size 8 \
  --max-num-seqs 2048 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.95 \
  --num-scheduler-steps 15

Low-Latency Configuration

bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_ROCM_USE_AITER=1" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --tensor-parallel-size 4 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.90

Chunked Prefill

In vLLM V1, chunked prefill is always enabled and cannot be disabled. Tune performance via --max-num-batched-tokens:

Value Effect
2048 (default) Better inter-token latency
8192-16384 Balanced TTFT/ITL
32768+ Better TTFT, higher throughput

Multi-Step Scheduling

Reduce GPU idle time by batching scheduler steps:

bash
--num-scheduler-steps 15
Value Benefit
1 Default, maximum flexibility
10-15 Reduced CPU overhead
20+ Diminishing returns

Test Matrix

max_num_seqs max_num_batched_tokens Expected Behavior
256 8192 Low memory, moderate throughput
512 16384 Balanced
1024 32768 High throughput
2048 65536 Maximum throughput

Benchmarking Script

Test concurrency scaling:

bash
#!/bin/bash
MODEL="Qwen/Qwen3-VL-235B-A22B-Instruct"
MAX_TOKENS=100

for CONCURRENT in 10 25 50 100 200; do
    REQUESTS=$((CONCURRENT * 2))

    echo "Testing: $CONCURRENT concurrent, $REQUESTS total"

    START=$(date +%s.%N)

    for i in $(seq 1 $REQUESTS); do
        curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{"model": "$MODEL", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": $MAX_TOKENS}" &

        while [ $(jobs -r | wc -l) -ge $CONCURRENT ]; do
            sleep 0.05
        done
    done
    wait

    END=$(date +%s.%N)
    ELAPSED=$(echo "$END - $START" | bc)

    echo "Completed in ${ELAPSED}s"
done

Comments