Focus Mode

Concurrency Tuning

Updated on 17 March, 2026

Maximize throughput by tuning vLLM for high concurrent request loads.

Key Parameters

max_num_seqs

Maximum concurrent sequences in a single batch.

Value	Use Case
128-256	Memory-constrained, large models
512-1024	Default, balanced performance
2048-4096	Maximum throughput, smaller models

bash

--max-num-seqs 2048

max_num_batched_tokens

Maximum total tokens across all sequences in a batch.

Value	Trade-off
2048	Better inter-token latency
8192-16384	Balanced for interactive use
32768-65536	Maximum throughput

bash

--max-num-batched-tokens 32768

gpu_memory_utilization

Fraction of GPU memory allocated for KV cache.

Value	Use Case
0.45	Multiple instances per GPU
0.90	Default, single instance
0.95	Maximum capacity

bash

--gpu-memory-utilization 0.95

Throughput Results (MI325X Verified)

All results below are multi-run means (n=5) using input=2048, output=512 tokens.

Qwen3-VL-235B (BF16)

Concurrent	Throughput	p99 Latency
10	1,902 tok/s	9.24s
50	6,961 tok/s	12.50s
100	11,198 tok/s	15.46s
200	11,193 tok/s	15.46s
500	11,209 tok/s	15.44s
750	11,208 tok/s	15.44s
1,000	11,218 tok/s	15.43s

Llama-3.1-405B (FP8)

Concurrent	Throughput	p99 Latency
10	1,090 tok/s	16.17s
50	4,381 tok/s	20.14s
100	6,802 tok/s	25.84s
200	6,674 tok/s	26.33s
500	6,804 tok/s	25.84s
750	6,808 tok/s	25.83s
1,000	6,798 tok/s	25.84s

DeepSeek V3.2 (685B, FP8+AITER)

Concurrent	Throughput	p99 Latency
10	2,857 tok/s	22.76s
50	5,694 tok/s	23.49s
100	5,518 tok/s	24.22s
200	5,486 tok/s	24.14s
500	5,657 tok/s	23.46s
750	5,550 tok/s	23.96s
1,000	5,786 tok/s	23.01s

Kimi-K2.5 (1T, INT4 QAT, TP=4)

Note

Kimi-K2.5 runs with AITER disabled and TP=4 (not TP=8) due to MLA head count constraints.

Concurrent	Throughput	p99 Latency
10	225 tok/s	77.83s
50	583 tok/s	149.39s
100	948 tok/s	183.35s
200	950 tok/s	182.96s
500	948 tok/s	183.23s
750	947 tok/s	183.54s
1,000	952 tok/s	182.52s

Peak Throughput & Saturation

Model	Peak Throughput	Saturation Point	p99 @ Peak
Qwen3-VL-235B	11,218 tok/s	~100 concurrent	15.43s
Llama-3.1-405B	6,808 tok/s	~100 concurrent	25.83s
DeepSeek V3.2	5,786 tok/s	~50 concurrent	23.01s
Kimi-K2.5 (TP=4)	952 tok/s	~100 concurrent	182.52s

Multi-run means (n=5). 100% success rate at all concurrency levels up to 1,000.

Key Findings

Throughput saturates early: All models reach peak throughput by 100 concurrent requests; additional concurrency provides no benefit
FP8 enables large models like Llama-405B and DeepSeek V3.2 on 8x MI325X
Architecture matters: MoE models with fewer active parameters (Qwen 22B active) achieve higher throughput than dense models (Llama 405B)
Active parameters predict throughput better than total parameters
100% reliability at all tested concurrency levels (up to 1,000) across all models

Cross-Model Analysis

Metric	Qwen3-VL-235B	Llama-3.1-405B	DeepSeek V3.2	Kimi-K2.5 (TP=4)
Peak per GPU	1,402 tok/s	851 tok/s	723 tok/s	238 tok/s*
Scaling (10→200)	5.9x	6.1x	1.9x	4.2x
p99 latency @ peak	15.43s	25.83s	23.01s	182.52s

*Kimi-K2.5 uses only 4 GPUs (TP=4), so per-GPU throughput is calculated as 952/4.

Key insights:

MoE advantage: Qwen's sparse activation (22B of 235B) delivers the highest per-GPU throughput despite being run in BF16
Dense models scale more linearly: Llama achieves 6.1x scaling (10→200) vs DeepSeek's 1.9x
MLA latency cost: DeepSeek V3.2 and Kimi-K2.5 (both MLA) show higher p99 latency than GQA models
Kimi-K2.5 constraint: Limited to TP=4 due to AITER MLA head count requirements, using only half the GPU cluster

Concurrency Recommendations

Use Case	Concurrency	Throughput	p99 Latency
Low latency	1-10	~1,900 tok/s	~9s
Balanced	50	~7,000 tok/s	~12s
High throughput	100+	~11,200 tok/s	~15s

Values based on Qwen3-VL-235B multi-run means (n=5, input=2048, output=512). Other models follow similar saturation patterns — see per-model tables above.

Configuration Profiles

High-Throughput Configuration

                            bash
                            
                        
# Environment
export VLLM_ROCM_USE_AITER=1
export NCCL_MIN_NCHANNELS=112

# Serving
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_ROCM_USE_AITER=1" \
  --env "NCCL_MIN_NCHANNELS=112" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --tensor-parallel-size 8 \
  --max-num-seqs 2048 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.95 \
  --num-scheduler-steps 15

Low-Latency Configuration

                            bash
                            
                        
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_ROCM_USE_AITER=1" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --tensor-parallel-size 4 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.90

Chunked Prefill

In vLLM V1, chunked prefill is always enabled and cannot be disabled. Tune performance via --max-num-batched-tokens:

Value	Effect
2048 (default)	Better inter-token latency
8192-16384	Balanced TTFT/ITL
32768+	Better TTFT, higher throughput

Multi-Step Scheduling

Reduce GPU idle time by batching scheduler steps:

bash

--num-scheduler-steps 15

Value	Benefit
1	Default, maximum flexibility
10-15	Reduced CPU overhead
20+	Diminishing returns

Test Matrix

max_num_seqs	max_num_batched_tokens	Expected Behavior
256	8192	Low memory, moderate throughput
512	16384	Balanced
1024	32768	High throughput
2048	65536	Maximum throughput

Benchmarking Script

Test concurrency scaling:

                            bash
                            
                        
#!/bin/bash
MODEL="Qwen/Qwen3-VL-235B-A22B-Instruct"
MAX_TOKENS=100

for CONCURRENT in 10 25 50 100 200; do
    REQUESTS=$((CONCURRENT * 2))

    echo "Testing: $CONCURRENT concurrent, $REQUESTS total"

    START=$(date +%s.%N)

    for i in $(seq 1 $REQUESTS); do
        curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{"model": "$MODEL", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": $MAX_TOKENS}" &

        while [ $(jobs -r | wc -l) -ge $CONCURRENT ]; do
            sleep 0.05
        done
    done
    wait

    END=$(date +%s.%N)
    ELAPSED=$(echo "$END - $START" | bc)

    echo "Completed in ${ELAPSED}s"
done

Concurrency Tuning

Key Parameters

max_num_seqs

max_num_batched_tokens

gpu_memory_utilization

Throughput Results (MI325X Verified)

Qwen3-VL-235B (BF16)

Llama-3.1-405B (FP8)

DeepSeek V3.2 (685B, FP8+AITER)

Kimi-K2.5 (1T, INT4 QAT, TP=4)

Peak Throughput & Saturation

Key Findings

Cross-Model Analysis

Concurrency Recommendations

Configuration Profiles

High-Throughput Configuration

Low-Latency Configuration

Chunked Prefill

Multi-Step Scheduling

Test Matrix

Benchmarking Script

Comments

Products

Features

Solutions

Marketplace

Resources

Company