Concurrency Tuning

Updated on 11 March, 2026

Maximize throughput by tuning vLLM for concurrent request loads on NVIDIA HGX B200 GPUs.


Key Parameters

max_num_seqs

Maximum concurrent sequences in a single batch.

Value Use Case
128-256 Memory-constrained, large models (GLM-5, DeepSeek V3.2)
512-1024 Default, balanced performance
2048-4096 Maximum throughput, smaller models (Nemotron Nano)
--max-num-seqs 2048

max_num_batched_tokens

Maximum total tokens across all sequences in a batch.

Value Trade-off
2048 Better inter-token latency
8192-16384 Balanced for interactive use
32768-65536 Maximum throughput
--max-num-batched-tokens 32768

gpu_memory_utilization

Fraction of GPU memory allocated for model weights and KV cache.

Value Use Case
0.45 Multiple instances per GPU
0.90 Default, single instance
0.95 Maximum capacity
--gpu-memory-utilization 0.95

Throughput Results (NVIDIA HGX B200 Verified)

All results use input=2048, output=512 tokens, random dataset.

Nemotron Nano 30B (FP8, TP=2, 2 GPUs)

Concurrent Output tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 277 362 2.91 3.14
8 1,881 152 3.96 4.28
16 3,455 160 4.32 5.19
32 3,798 206 7.86 40.72
64 8,120 324 6.74 43.82
128 11,556 520 8.91 45.19
256 15,552 1,254 11.77 48.20
512 18,746 2,610 15.46 47.88
1024 18,829 2,563 15.46 47.75

Saturation: ~512 concurrent. Peak: 18,829 tok/s sustained on 2 GPUs.

Nemotron Super 49B (FP8, TP=1, 1 GPU)

Concurrent Output tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 73 170 13.40 14.44
8 570 71 12.34 13.65
16 1,054 117 11.17 13.66
32 2,120 172 11.57 12.09
64 3,816 205 12.70 13.86
128 3,799 222 12.70 13.43
256 1,586 5,551 20.56 192.31
512 3,775 271 12.69 13.66
1024 1,587 5,544 20.56 189.99

Saturation: ~64 concurrent. Peak: 3,816 tok/s on 1 GPU. Throughput oscillates at c=256/1024 due to TP=1 batch scheduling effects.

MiniMax M2.5 (FP8, TP=4, 4 GPUs)

Concurrent Output tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 87 84 11.40 12.75
8 581 217 13.37 13.72
16 1,057 77 15.01 15.85
32 1,636 91 18.82 22.04
64 2,591 143 22.71 27.37
128 3,943 220 27.90 31.98
256 5,945 393 34.05 41.89
512 8,822 686 43.79 52.39
1024 8,838 687 43.73 52.25

Saturation: ~512 concurrent. Peak: 8,838 tok/s sustained on 4 GPUs.

GLM-5 744B (FP8, TP=8, 8 GPUs)

Concurrent Output tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 54 183 18.31 18.45
8 317 1,249 22.85 22.20
16 606 788 24.89 24.94
32 884 1,341 33.56 31.86
64 1,456 2,031 39.94 474.93
128 2,132 3,558 52.93 569.76
256 2,071 6,207 111.12 580.49
512 1,955 57,202 139.19 584.22
1024 1,944 175,241 143.10 584.03

Saturation: ~128 concurrent. Peak: 2,132 tok/s sustained on 8 GPUs.

DeepSeek V3.2 685B (FP8, TP=8, 8 GPUs)

Concurrent Output tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 106 92 9.28 9.46
8 503 1,311 13.36 12.90
16 930 618 16.00 15.47
32 1,382 931 21.34 22.32
64 2,281 843 26.39 143.82
128 3,545 1,263 33.52 222.37
256 4,010 2,607 58.45 232.63
512 4,321 7,397 98.86 241.35
1024 4,370 58,298 104.21 243.37

Saturation: ~512 concurrent. Peak: 4,370 tok/s sustained on 8 GPUs.

Peak Throughput and Saturation

Model GPUs Peak tok/s tok/s/GPU Saturation Point
Nemotron Nano 30B 2 18,829 9,415 ~512 concurrent
Nemotron Super 49B 1 3,816 3,816 ~64 concurrent
MiniMax M2.5 229B 4 8,838 2,210 ~512 concurrent
GLM-5 744B 8 2,132 267 ~128 concurrent
DeepSeek V3.2 685B 8 4,370 546 ~512 concurrent

Key Findings

  • Active parameters predict per-GPU throughput: Nemotron Nano (3B active) achieves 9,415 tok/s/GPU; MiniMax M2.5 (10B active) achieves 2,210 tok/s/GPU; DeepSeek V3.2 (37B active) achieves 546 tok/s/GPU. Fewer active parameters = higher per-GPU throughput.
  • Saturation varies by KV cache pressure: Models with efficient KV caching (MLA, Mamba) saturate at ~512 concurrent. GLM-5 saturates earlier at ~128 due to higher KV cache consumption per token.
  • MoE efficiency: Four of five models leverage MoE to activate only a fraction of total parameters per token, enabling high throughput despite large total parameter counts. Nemotron Super 49B is the exception: a dense transformer where all 49B parameters are active on every token.
  • TTFT degrades linearly with concurrency: Expected behavior as prefill requests queue behind active decode operations.
  • DeepSeek V3.2 scales best among TP=8 models: MLA's compressed KV cache provides 1.15M tokens capacity vs GLM-5's 691K, enabling 2x the sustained throughput at high concurrency.
  • Zero failures: 100% success rate at all concurrency levels up to 1,024 for all five models.

TTFT vs TPOT Trade-Offs

Interactive applications should optimize for TTFT (time to first token): the user-perceived latency before streaming begins. Batch processing should optimize for throughput (tok/s).

Use Case Target TTFT Recommended Concurrency Expected Throughput
Real-time chat < 200 ms 1-16 Model dependent
Interactive API < 500 ms 16-64 50-70% of peak
Batch processing Don't care 256-512 ~100% of peak

Nemotron Nano TTFT Thresholds

TTFT Target Max Concurrency Throughput Achieved % of Peak
< 200 ms 16 3,455 tok/s 18%
< 350 ms 64 8,120 tok/s 43%
< 600 ms 128 11,556 tok/s 61%
< 1,500 ms 256 15,552 tok/s 83%
< 3,000 ms 512 18,746 tok/s 100%
Unconstrained 1024+ 18,829 tok/s 100%

MiniMax M2.5 TTFT Thresholds

TTFT Target Max Concurrency Throughput Achieved % of Peak
< 100 ms 32 1,636 tok/s 19%
< 250 ms 128 3,943 tok/s 45%
< 500 ms 256 5,945 tok/s 67%
< 1,000 ms 1024 8,838 tok/s 100%

Goodput Benchmarks (Nemotron Nano FP8)

Goodput measures requests per second that meet all latency SLOs simultaneously. Tested with TTFT < 500ms and TPOT < 50ms.

Concurrency Goodput (req/s) Output tok/s Mean TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) P99 TPOT (ms)
1 0.54 281 363 7,998 2.85 2.91
32 6.10 3,331 688 8,637 8.10 23.80
64 13.29 8,027 349 841 6.79 7.35
128 6.20 11,343 727 1,621 8.72 9.83
256 3.65 15,237 1,377 3,254 11.78 14.23
512 2.08 18,535 2,614 5,050 15.70 19.27

Peak goodput at c=64 (13.29 req/s). Beyond this, TTFT exceeds 500ms and most requests violate the SLO. TPOT stays well under the 50ms SLO at all concurrency levels: TTFT is the binding constraint.

Nemotron Nano NVFP4 (TP=1, 1 GPU)

Concurrency Goodput (req/s) Output tok/s Mean TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) P99 TPOT (ms)
1 0.54 282 210 3,775 3.14 3.15
32 7.21 4,067 487 3,255 6.73 7.34
64 8.90 6,484 427 1,020 8.43 9.17
128 8.24 9,127 682 2,034 11.35 12.47
256 8.87 11,950 1,589 4,094 15.53 18.35
512 0.55 14,051 3,341 6,469 21.00 25.42

NVFP4 peaks at 8.90 req/s (c=64): lower than FP8's 13.29 because TP=1 means all prefill runs on a single GPU, making TTFT the bottleneck sooner. However, NVFP4 uses half the GPUs (1 vs 2), so it delivers better per-GPU efficiency even under SLO constraints.

For production with strict SLOs, target c=32–64. For batch processing where latency doesn't matter, push to c=512+.

Configuration Profiles

High-Throughput (Batch Processing)

console
$ vllm serve <model> \
  --tensor-parallel-size <TP> \
  --max-model-len 32768 \
  --max-num-seqs 2048 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

Low-Latency (Interactive)

console
$ vllm serve <model> \
  --tensor-parallel-size <TP> \
  --max-model-len 8192 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Multi-Instance (Maximum Node Utilization)

For smaller models that don't need all 8 GPUs, run multiple independent instances:

console
# Instance 1: GPUs 0-1
$ CUDA_VISIBLE_DEVICES=0,1 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --port 8000 --trust-remote-code &

# Instance 2: GPUs 2-3
$ CUDA_VISIBLE_DEVICES=2,3 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --port 8001 --trust-remote-code &

# Instance 3: GPUs 4-5
$ CUDA_VISIBLE_DEVICES=4,5 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --port 8002 --trust-remote-code &

# Instance 4: GPUs 6-7
$ CUDA_VISIBLE_DEVICES=6,7 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --port 8003 --trust-remote-code &

This achieves ~75,000 tok/s aggregate throughput on a single 8-GPU node: 4x the single-instance peak.

Chunked Prefill

In vLLM V1 (0.12.0+), chunked prefill is always enabled. Tune the chunk size via --max-num-batched-tokens:

Value Effect
2048 (default) Better inter-token latency, lower TPOT
8192-16384 Balanced TTFT and ITL
32768+ Better TTFT, higher throughput, higher ITL

Multi-Step Scheduling

Reduce CPU-GPU synchronization overhead by batching scheduler steps:

--num-scheduler-steps 10
Value Benefit
1 Default, maximum scheduling flexibility
5-10 Reduced CPU overhead, slightly higher throughput
15-20 Diminishing returns

Benchmarking Your Configuration

Use vLLM's built-in benchmark tool to test your specific setup:

console
# Single concurrency level
$ vllm bench serve \
  --base-url http://localhost:8000 \
  --model <model_id> \
  --dataset-name random \
  --random-input-len 2048 \
  --random-output-len 512 \
  --num-prompts 400 \
  --max-concurrency 64 \
  --save-result \
  --result-filename results.json

# Full sweep (use the included script)
$ ./scripts/bench.sh <model_id> http://localhost:8000

Comments