Maximize throughput by tuning vLLM for high concurrent request loads.
Key Parameters
max_num_seqs
Maximum concurrent sequences in a single batch.
Value
Use Case
128-256
Memory-constrained, large models
512-1024
Default, balanced performance
2048-4096
Maximum throughput, smaller models
bash
--max-num-seqs2048
max_num_batched_tokens
Maximum total tokens across all sequences in a batch.
Value
Trade-off
2048
Better inter-token latency
8192-16384
Balanced for interactive use
32768-65536
Maximum throughput
bash
--max-num-batched-tokens32768
gpu_memory_utilization
Fraction of GPU memory allocated for KV cache.
Value
Use Case
0.45
Multiple instances per GPU
0.90
Default, single instance
0.95
Maximum capacity
bash
--gpu-memory-utilization0.95
Throughput Results (MI325X Verified)
All results below are multi-run means (n=5) using input=2048, output=512 tokens.
Qwen3-VL-235B (BF16)
Concurrent
Throughput
p99 Latency
10
1,902 tok/s
9.24s
50
6,961 tok/s
12.50s
100
11,198 tok/s
15.46s
200
11,193 tok/s
15.46s
500
11,209 tok/s
15.44s
750
11,208 tok/s
15.44s
1,000
11,218 tok/s
15.43s
Llama-3.1-405B (FP8)
Concurrent
Throughput
p99 Latency
10
1,090 tok/s
16.17s
50
4,381 tok/s
20.14s
100
6,802 tok/s
25.84s
200
6,674 tok/s
26.33s
500
6,804 tok/s
25.84s
750
6,808 tok/s
25.83s
1,000
6,798 tok/s
25.84s
DeepSeek V3.2 (685B, FP8+AITER)
Concurrent
Throughput
p99 Latency
10
2,857 tok/s
22.76s
50
5,694 tok/s
23.49s
100
5,518 tok/s
24.22s
200
5,486 tok/s
24.14s
500
5,657 tok/s
23.46s
750
5,550 tok/s
23.96s
1,000
5,786 tok/s
23.01s
Kimi-K2.5 (1T, INT4 QAT, TP=4)
Note
Kimi-K2.5 runs with AITER disabled and TP=4 (not TP=8) due to MLA head count constraints.
Concurrent
Throughput
p99 Latency
10
225 tok/s
77.83s
50
583 tok/s
149.39s
100
948 tok/s
183.35s
200
950 tok/s
182.96s
500
948 tok/s
183.23s
750
947 tok/s
183.54s
1,000
952 tok/s
182.52s
Peak Throughput & Saturation
Model
Peak Throughput
Saturation Point
p99 @ Peak
Qwen3-VL-235B
11,218 tok/s
~100 concurrent
15.43s
Llama-3.1-405B
6,808 tok/s
~100 concurrent
25.83s
DeepSeek V3.2
5,786 tok/s
~50 concurrent
23.01s
Kimi-K2.5 (TP=4)
952 tok/s
~100 concurrent
182.52s
Multi-run means (n=5). 100% success rate at all concurrency levels up to 1,000.
Key Findings
Throughput saturates early: All models reach peak throughput by 100 concurrent requests; additional concurrency provides no benefit
FP8 enables large models like Llama-405B and DeepSeek V3.2 on 8x MI325X
Architecture matters: MoE models with fewer active parameters (Qwen 22B active) achieve higher throughput than dense models (Llama 405B)
Active parameters predict throughput better than total parameters
100% reliability at all tested concurrency levels (up to 1,000) across all models
Cross-Model Analysis
Metric
Qwen3-VL-235B
Llama-3.1-405B
DeepSeek V3.2
Kimi-K2.5 (TP=4)
Peak per GPU
1,402 tok/s
851 tok/s
723 tok/s
238 tok/s*
Scaling (10→200)
5.9x
6.1x
1.9x
4.2x
p99 latency @ peak
15.43s
25.83s
23.01s
182.52s
*Kimi-K2.5 uses only 4 GPUs (TP=4), so per-GPU throughput is calculated as 952/4.
Key insights:
MoE advantage: Qwen's sparse activation (22B of 235B) delivers the highest per-GPU throughput despite being run in BF16
Dense models scale more linearly: Llama achieves 6.1x scaling (10→200) vs DeepSeek's 1.9x
MLA latency cost: DeepSeek V3.2 and Kimi-K2.5 (both MLA) show higher p99 latency than GQA models
Kimi-K2.5 constraint: Limited to TP=4 due to AITER MLA head count requirements, using only half the GPU cluster
Concurrency Recommendations
Use Case
Concurrency
Throughput
p99 Latency
Low latency
1-10
~1,900 tok/s
~9s
Balanced
50
~7,000 tok/s
~12s
High throughput
100+
~11,200 tok/s
~15s
Values based on Qwen3-VL-235B multi-run means (n=5, input=2048, output=512). Other models follow similar saturation patterns — see per-model tables above.