Focus Mode

Concurrency Tuning

Updated on 11 March, 2026

Maximize throughput by tuning vLLM for concurrent request loads on NVIDIA HGX B200 GPUs.

Key Parameters

max_num_seqs

Maximum concurrent sequences in a single batch.

Value	Use Case
128-256	Memory-constrained, large models (GLM-5, DeepSeek V3.2)
512-1024	Default, balanced performance
2048-4096	Maximum throughput, smaller models (Nemotron Nano)

--max-num-seqs 2048

max_num_batched_tokens

Maximum total tokens across all sequences in a batch.

Value	Trade-off
2048	Better inter-token latency
8192-16384	Balanced for interactive use
32768-65536	Maximum throughput

--max-num-batched-tokens 32768

gpu_memory_utilization

Fraction of GPU memory allocated for model weights and KV cache.

Value	Use Case
0.45	Multiple instances per GPU
0.90	Default, single instance
0.95	Maximum capacity

--gpu-memory-utilization 0.95

Throughput Results (NVIDIA HGX B200 Verified)

All results use input=2048, output=512 tokens, random dataset.

Nemotron Nano 30B (FP8, TP=2, 2 GPUs)

Concurrent	Output tok/s	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
1	277	362	2.91	3.14
8	1,881	152	3.96	4.28
16	3,455	160	4.32	5.19
32	3,798	206	7.86	40.72
64	8,120	324	6.74	43.82
128	11,556	520	8.91	45.19
256	15,552	1,254	11.77	48.20
512	18,746	2,610	15.46	47.88
1024	18,829	2,563	15.46	47.75

Saturation: ~512 concurrent. Peak: 18,829 tok/s sustained on 2 GPUs.

Nemotron Super 49B (FP8, TP=1, 1 GPU)

Concurrent	Output tok/s	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
1	73	170	13.40	14.44
8	570	71	12.34	13.65
16	1,054	117	11.17	13.66
32	2,120	172	11.57	12.09
64	3,816	205	12.70	13.86
128	3,799	222	12.70	13.43
256	1,586	5,551	20.56	192.31
512	3,775	271	12.69	13.66
1024	1,587	5,544	20.56	189.99

Saturation: ~64 concurrent. Peak: 3,816 tok/s on 1 GPU. Throughput oscillates at c=256/1024 due to TP=1 batch scheduling effects.

MiniMax M2.5 (FP8, TP=4, 4 GPUs)

Concurrent	Output tok/s	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
1	87	84	11.40	12.75
8	581	217	13.37	13.72
16	1,057	77	15.01	15.85
32	1,636	91	18.82	22.04
64	2,591	143	22.71	27.37
128	3,943	220	27.90	31.98
256	5,945	393	34.05	41.89
512	8,822	686	43.79	52.39
1024	8,838	687	43.73	52.25

Saturation: ~512 concurrent. Peak: 8,838 tok/s sustained on 4 GPUs.

GLM-5 744B (FP8, TP=8, 8 GPUs)

Concurrent	Output tok/s	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
1	54	183	18.31	18.45
8	317	1,249	22.85	22.20
16	606	788	24.89	24.94
32	884	1,341	33.56	31.86
64	1,456	2,031	39.94	474.93
128	2,132	3,558	52.93	569.76
256	2,071	6,207	111.12	580.49
512	1,955	57,202	139.19	584.22
1024	1,944	175,241	143.10	584.03

Saturation: ~128 concurrent. Peak: 2,132 tok/s sustained on 8 GPUs.

DeepSeek V3.2 685B (FP8, TP=8, 8 GPUs)

Concurrent	Output tok/s	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
1	106	92	9.28	9.46
8	503	1,311	13.36	12.90
16	930	618	16.00	15.47
32	1,382	931	21.34	22.32
64	2,281	843	26.39	143.82
128	3,545	1,263	33.52	222.37
256	4,010	2,607	58.45	232.63
512	4,321	7,397	98.86	241.35
1024	4,370	58,298	104.21	243.37

Saturation: ~512 concurrent. Peak: 4,370 tok/s sustained on 8 GPUs.

Peak Throughput and Saturation

Model	GPUs	Peak tok/s	tok/s/GPU	Saturation Point
Nemotron Nano 30B	2	18,829	9,415	~512 concurrent
Nemotron Super 49B	1	3,816	3,816	~64 concurrent
MiniMax M2.5 229B	4	8,838	2,210	~512 concurrent
GLM-5 744B	8	2,132	267	~128 concurrent
DeepSeek V3.2 685B	8	4,370	546	~512 concurrent

Key Findings

Active parameters predict per-GPU throughput: Nemotron Nano (3B active) achieves 9,415 tok/s/GPU; MiniMax M2.5 (10B active) achieves 2,210 tok/s/GPU; DeepSeek V3.2 (37B active) achieves 546 tok/s/GPU. Fewer active parameters = higher per-GPU throughput.
Saturation varies by KV cache pressure: Models with efficient KV caching (MLA, Mamba) saturate at ~512 concurrent. GLM-5 saturates earlier at ~128 due to higher KV cache consumption per token.
MoE efficiency: Four of five models leverage MoE to activate only a fraction of total parameters per token, enabling high throughput despite large total parameter counts. Nemotron Super 49B is the exception: a dense transformer where all 49B parameters are active on every token.
TTFT degrades linearly with concurrency: Expected behavior as prefill requests queue behind active decode operations.
DeepSeek V3.2 scales best among TP=8 models: MLA's compressed KV cache provides 1.15M tokens capacity vs GLM-5's 691K, enabling 2x the sustained throughput at high concurrency.
Zero failures: 100% success rate at all concurrency levels up to 1,024 for all five models.

TTFT vs TPOT Trade-Offs

Interactive applications should optimize for TTFT (time to first token): the user-perceived latency before streaming begins. Batch processing should optimize for throughput (tok/s).

Use Case	Target TTFT	Recommended Concurrency	Expected Throughput
Real-time chat	< 200 ms	1-16	Model dependent
Interactive API	< 500 ms	16-64	50-70% of peak
Batch processing	Don't care	256-512	~100% of peak

Nemotron Nano TTFT Thresholds

TTFT Target	Max Concurrency	Throughput Achieved	% of Peak
< 200 ms	16	3,455 tok/s	18%
< 350 ms	64	8,120 tok/s	43%
< 600 ms	128	11,556 tok/s	61%
< 1,500 ms	256	15,552 tok/s	83%
< 3,000 ms	512	18,746 tok/s	100%
Unconstrained	1024+	18,829 tok/s	100%

MiniMax M2.5 TTFT Thresholds

TTFT Target	Max Concurrency	Throughput Achieved	% of Peak
< 100 ms	32	1,636 tok/s	19%
< 250 ms	128	3,943 tok/s	45%
< 500 ms	256	5,945 tok/s	67%
< 1,000 ms	1024	8,838 tok/s	100%

Goodput Benchmarks (Nemotron Nano FP8)

Goodput measures requests per second that meet all latency SLOs simultaneously. Tested with TTFT < 500ms and TPOT < 50ms.

Concurrency	Goodput (req/s)	Output tok/s	Mean TTFT (ms)	P99 TTFT (ms)	Mean TPOT (ms)	P99 TPOT (ms)
1	0.54	281	363	7,998	2.85	2.91
32	6.10	3,331	688	8,637	8.10	23.80
64	13.29	8,027	349	841	6.79	7.35
128	6.20	11,343	727	1,621	8.72	9.83
256	3.65	15,237	1,377	3,254	11.78	14.23
512	2.08	18,535	2,614	5,050	15.70	19.27

Peak goodput at c=64 (13.29 req/s). Beyond this, TTFT exceeds 500ms and most requests violate the SLO. TPOT stays well under the 50ms SLO at all concurrency levels: TTFT is the binding constraint.

Nemotron Nano NVFP4 (TP=1, 1 GPU)

Concurrency	Goodput (req/s)	Output tok/s	Mean TTFT (ms)	P99 TTFT (ms)	Mean TPOT (ms)	P99 TPOT (ms)
1	0.54	282	210	3,775	3.14	3.15
32	7.21	4,067	487	3,255	6.73	7.34
64	8.90	6,484	427	1,020	8.43	9.17
128	8.24	9,127	682	2,034	11.35	12.47
256	8.87	11,950	1,589	4,094	15.53	18.35
512	0.55	14,051	3,341	6,469	21.00	25.42

NVFP4 peaks at 8.90 req/s (c=64): lower than FP8's 13.29 because TP=1 means all prefill runs on a single GPU, making TTFT the bottleneck sooner. However, NVFP4 uses half the GPUs (1 vs 2), so it delivers better per-GPU efficiency even under SLO constraints.

For production with strict SLOs, target c=32–64. For batch processing where latency doesn't matter, push to c=512+.

Configuration Profiles

High-Throughput (Batch Processing)

                            console
                            
                        
$ vllm serve <model> \
  --tensor-parallel-size <TP> \
  --max-model-len 32768 \
  --max-num-seqs 2048 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

Low-Latency (Interactive)

                            console
                            
                        
$ vllm serve <model> \
  --tensor-parallel-size <TP> \
  --max-model-len 8192 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Multi-Instance (Maximum Node Utilization)

For smaller models that don't need all 8 GPUs, run multiple independent instances:

                            console
                            
                        
# Instance 1: GPUs 0-1
$ CUDA_VISIBLE_DEVICES=0,1 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --port 8000 --trust-remote-code &

# Instance 2: GPUs 2-3
$ CUDA_VISIBLE_DEVICES=2,3 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --port 8001 --trust-remote-code &

# Instance 3: GPUs 4-5
$ CUDA_VISIBLE_DEVICES=4,5 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --port 8002 --trust-remote-code &

# Instance 4: GPUs 6-7
$ CUDA_VISIBLE_DEVICES=6,7 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --port 8003 --trust-remote-code &

This achieves ~75,000 tok/s aggregate throughput on a single 8-GPU node: 4x the single-instance peak.

Chunked Prefill

In vLLM V1 (0.12.0+), chunked prefill is always enabled. Tune the chunk size via --max-num-batched-tokens:

Value	Effect
2048 (default)	Better inter-token latency, lower TPOT
8192-16384	Balanced TTFT and ITL
32768+	Better TTFT, higher throughput, higher ITL

Multi-Step Scheduling

Reduce CPU-GPU synchronization overhead by batching scheduler steps:

--num-scheduler-steps 10

Value	Benefit
1	Default, maximum scheduling flexibility
5-10	Reduced CPU overhead, slightly higher throughput
15-20	Diminishing returns

Benchmarking Your Configuration

Use vLLM's built-in benchmark tool to test your specific setup:

                            console
                            
                        
# Single concurrency level
$ vllm bench serve \
  --base-url http://localhost:8000 \
  --model <model_id> \
  --dataset-name random \
  --random-input-len 2048 \
  --random-output-len 512 \
  --num-prompts 400 \
  --max-concurrency 64 \
  --save-result \
  --result-filename results.json

# Full sweep (use the included script)
$ ./scripts/bench.sh <model_id> http://localhost:8000