Maximize throughput by tuning vLLM for concurrent request loads on NVIDIA HGX B200 GPUs.
Maximum concurrent sequences in a single batch.
| Value | Use Case |
|---|---|
| 128-256 | Memory-constrained, large models (GLM-5, DeepSeek V3.2) |
| 512-1024 | Default, balanced performance |
| 2048-4096 | Maximum throughput, smaller models (Nemotron Nano) |
--max-num-seqs 2048Maximum total tokens across all sequences in a batch.
| Value | Trade-off |
|---|---|
| 2048 | Better inter-token latency |
| 8192-16384 | Balanced for interactive use |
| 32768-65536 | Maximum throughput |
--max-num-batched-tokens 32768Fraction of GPU memory allocated for model weights and KV cache.
| Value | Use Case |
|---|---|
| 0.45 | Multiple instances per GPU |
| 0.90 | Default, single instance |
| 0.95 | Maximum capacity |
--gpu-memory-utilization 0.95All results use input=2048, output=512 tokens, random dataset.
| Concurrent | Output tok/s | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|---|
| 1 | 277 | 362 | 2.91 | 3.14 |
| 8 | 1,881 | 152 | 3.96 | 4.28 |
| 16 | 3,455 | 160 | 4.32 | 5.19 |
| 32 | 3,798 | 206 | 7.86 | 40.72 |
| 64 | 8,120 | 324 | 6.74 | 43.82 |
| 128 | 11,556 | 520 | 8.91 | 45.19 |
| 256 | 15,552 | 1,254 | 11.77 | 48.20 |
| 512 | 18,746 | 2,610 | 15.46 | 47.88 |
| 1024 | 18,829 | 2,563 | 15.46 | 47.75 |
Saturation: ~512 concurrent. Peak: 18,829 tok/s sustained on 2 GPUs.
| Concurrent | Output tok/s | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|---|
| 1 | 73 | 170 | 13.40 | 14.44 |
| 8 | 570 | 71 | 12.34 | 13.65 |
| 16 | 1,054 | 117 | 11.17 | 13.66 |
| 32 | 2,120 | 172 | 11.57 | 12.09 |
| 64 | 3,816 | 205 | 12.70 | 13.86 |
| 128 | 3,799 | 222 | 12.70 | 13.43 |
| 256 | 1,586 | 5,551 | 20.56 | 192.31 |
| 512 | 3,775 | 271 | 12.69 | 13.66 |
| 1024 | 1,587 | 5,544 | 20.56 | 189.99 |
Saturation: ~64 concurrent. Peak: 3,816 tok/s on 1 GPU. Throughput oscillates at c=256/1024 due to TP=1 batch scheduling effects.
| Concurrent | Output tok/s | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|---|
| 1 | 87 | 84 | 11.40 | 12.75 |
| 8 | 581 | 217 | 13.37 | 13.72 |
| 16 | 1,057 | 77 | 15.01 | 15.85 |
| 32 | 1,636 | 91 | 18.82 | 22.04 |
| 64 | 2,591 | 143 | 22.71 | 27.37 |
| 128 | 3,943 | 220 | 27.90 | 31.98 |
| 256 | 5,945 | 393 | 34.05 | 41.89 |
| 512 | 8,822 | 686 | 43.79 | 52.39 |
| 1024 | 8,838 | 687 | 43.73 | 52.25 |
Saturation: ~512 concurrent. Peak: 8,838 tok/s sustained on 4 GPUs.
| Concurrent | Output tok/s | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|---|
| 1 | 54 | 183 | 18.31 | 18.45 |
| 8 | 317 | 1,249 | 22.85 | 22.20 |
| 16 | 606 | 788 | 24.89 | 24.94 |
| 32 | 884 | 1,341 | 33.56 | 31.86 |
| 64 | 1,456 | 2,031 | 39.94 | 474.93 |
| 128 | 2,132 | 3,558 | 52.93 | 569.76 |
| 256 | 2,071 | 6,207 | 111.12 | 580.49 |
| 512 | 1,955 | 57,202 | 139.19 | 584.22 |
| 1024 | 1,944 | 175,241 | 143.10 | 584.03 |
Saturation: ~128 concurrent. Peak: 2,132 tok/s sustained on 8 GPUs.
| Concurrent | Output tok/s | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|---|
| 1 | 106 | 92 | 9.28 | 9.46 |
| 8 | 503 | 1,311 | 13.36 | 12.90 |
| 16 | 930 | 618 | 16.00 | 15.47 |
| 32 | 1,382 | 931 | 21.34 | 22.32 |
| 64 | 2,281 | 843 | 26.39 | 143.82 |
| 128 | 3,545 | 1,263 | 33.52 | 222.37 |
| 256 | 4,010 | 2,607 | 58.45 | 232.63 |
| 512 | 4,321 | 7,397 | 98.86 | 241.35 |
| 1024 | 4,370 | 58,298 | 104.21 | 243.37 |
Saturation: ~512 concurrent. Peak: 4,370 tok/s sustained on 8 GPUs.
| Model | GPUs | Peak tok/s | tok/s/GPU | Saturation Point |
|---|---|---|---|---|
| Nemotron Nano 30B | 2 | 18,829 | 9,415 | ~512 concurrent |
| Nemotron Super 49B | 1 | 3,816 | 3,816 | ~64 concurrent |
| MiniMax M2.5 229B | 4 | 8,838 | 2,210 | ~512 concurrent |
| GLM-5 744B | 8 | 2,132 | 267 | ~128 concurrent |
| DeepSeek V3.2 685B | 8 | 4,370 | 546 | ~512 concurrent |
Interactive applications should optimize for TTFT (time to first token): the user-perceived latency before streaming begins. Batch processing should optimize for throughput (tok/s).
| Use Case | Target TTFT | Recommended Concurrency | Expected Throughput |
|---|---|---|---|
| Real-time chat | < 200 ms | 1-16 | Model dependent |
| Interactive API | < 500 ms | 16-64 | 50-70% of peak |
| Batch processing | Don't care | 256-512 | ~100% of peak |
| TTFT Target | Max Concurrency | Throughput Achieved | % of Peak |
|---|---|---|---|
| < 200 ms | 16 | 3,455 tok/s | 18% |
| < 350 ms | 64 | 8,120 tok/s | 43% |
| < 600 ms | 128 | 11,556 tok/s | 61% |
| < 1,500 ms | 256 | 15,552 tok/s | 83% |
| < 3,000 ms | 512 | 18,746 tok/s | 100% |
| Unconstrained | 1024+ | 18,829 tok/s | 100% |
| TTFT Target | Max Concurrency | Throughput Achieved | % of Peak |
|---|---|---|---|
| < 100 ms | 32 | 1,636 tok/s | 19% |
| < 250 ms | 128 | 3,943 tok/s | 45% |
| < 500 ms | 256 | 5,945 tok/s | 67% |
| < 1,000 ms | 1024 | 8,838 tok/s | 100% |
Goodput measures requests per second that meet all latency SLOs simultaneously. Tested with TTFT < 500ms and TPOT < 50ms.
| Concurrency | Goodput (req/s) | Output tok/s | Mean TTFT (ms) | P99 TTFT (ms) | Mean TPOT (ms) | P99 TPOT (ms) |
|---|---|---|---|---|---|---|
| 1 | 0.54 | 281 | 363 | 7,998 | 2.85 | 2.91 |
| 32 | 6.10 | 3,331 | 688 | 8,637 | 8.10 | 23.80 |
| 64 | 13.29 | 8,027 | 349 | 841 | 6.79 | 7.35 |
| 128 | 6.20 | 11,343 | 727 | 1,621 | 8.72 | 9.83 |
| 256 | 3.65 | 15,237 | 1,377 | 3,254 | 11.78 | 14.23 |
| 512 | 2.08 | 18,535 | 2,614 | 5,050 | 15.70 | 19.27 |
Peak goodput at c=64 (13.29 req/s). Beyond this, TTFT exceeds 500ms and most requests violate the SLO. TPOT stays well under the 50ms SLO at all concurrency levels: TTFT is the binding constraint.
| Concurrency | Goodput (req/s) | Output tok/s | Mean TTFT (ms) | P99 TTFT (ms) | Mean TPOT (ms) | P99 TPOT (ms) |
|---|---|---|---|---|---|---|
| 1 | 0.54 | 282 | 210 | 3,775 | 3.14 | 3.15 |
| 32 | 7.21 | 4,067 | 487 | 3,255 | 6.73 | 7.34 |
| 64 | 8.90 | 6,484 | 427 | 1,020 | 8.43 | 9.17 |
| 128 | 8.24 | 9,127 | 682 | 2,034 | 11.35 | 12.47 |
| 256 | 8.87 | 11,950 | 1,589 | 4,094 | 15.53 | 18.35 |
| 512 | 0.55 | 14,051 | 3,341 | 6,469 | 21.00 | 25.42 |
NVFP4 peaks at 8.90 req/s (c=64): lower than FP8's 13.29 because TP=1 means all prefill runs on a single GPU, making TTFT the bottleneck sooner. However, NVFP4 uses half the GPUs (1 vs 2), so it delivers better per-GPU efficiency even under SLO constraints.
For production with strict SLOs, target c=32–64. For batch processing where latency doesn't matter, push to c=512+.
$ vllm serve <model> \
--tensor-parallel-size <TP> \
--max-model-len 32768 \
--max-num-seqs 2048 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.95 \
--trust-remote-code
$ vllm serve <model> \
--tensor-parallel-size <TP> \
--max-model-len 8192 \
--max-num-seqs 512 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
For smaller models that don't need all 8 GPUs, run multiple independent instances:
# Instance 1: GPUs 0-1
$ CUDA_VISIBLE_DEVICES=0,1 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 --port 8000 --trust-remote-code &
# Instance 2: GPUs 2-3
$ CUDA_VISIBLE_DEVICES=2,3 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 --port 8001 --trust-remote-code &
# Instance 3: GPUs 4-5
$ CUDA_VISIBLE_DEVICES=4,5 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 --port 8002 --trust-remote-code &
# Instance 4: GPUs 6-7
$ CUDA_VISIBLE_DEVICES=6,7 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 --port 8003 --trust-remote-code &
This achieves ~75,000 tok/s aggregate throughput on a single 8-GPU node: 4x the single-instance peak.
In vLLM V1 (0.12.0+), chunked prefill is always enabled. Tune the chunk size via --max-num-batched-tokens:
| Value | Effect |
|---|---|
| 2048 (default) | Better inter-token latency, lower TPOT |
| 8192-16384 | Balanced TTFT and ITL |
| 32768+ | Better TTFT, higher throughput, higher ITL |
Reduce CPU-GPU synchronization overhead by batching scheduler steps:
--num-scheduler-steps 10| Value | Benefit |
|---|---|
| 1 | Default, maximum scheduling flexibility |
| 5-10 | Reduced CPU overhead, slightly higher throughput |
| 15-20 | Diminishing returns |
Use vLLM's built-in benchmark tool to test your specific setup:
# Single concurrency level
$ vllm bench serve \
--base-url http://localhost:8000 \
--model <model_id> \
--dataset-name random \
--random-input-len 2048 \
--random-output-len 512 \
--num-prompts 400 \
--max-concurrency 64 \
--save-result \
--result-filename results.json
# Full sweep (use the included script)
$ ./scripts/bench.sh <model_id> http://localhost:8000