How we measure LLM inference performance on NVIDIA HGX B200 GPUs.
| Parameter | Value | Rationale |
|---|---|---|
| Input tokens | 2,048 | Represents a typical prompt with context |
| Output tokens | 512 | Standard generation length |
| Dataset | Random (synthetic) | Eliminates tokenizer variance. Consistent across models |
| Prompts per run | 50 (c=1), 400 (c>1) | Enough for stable averages at each concurrency level (see note) |
| Concurrency levels | 1, 8, 16, 32, 64, 128, 256, 512, 1024 | Sweep from single-user to saturation |
| Serving framework | vLLM 0.12.0 / 0.16.0 | See note below |
| GPU utilization | 0.90 | 90% of VRAM allocated for model + KV cache |
n_group=0 incompatible with the fused DeepSeekV3 routing kernel). Both versions use the same vllm bench serve interface with identical parameters.
bench_all.sh. DeepSeek V3.2 and GLM-5 were benchmarked with 4× concurrency prompts per level (e.g., 4 at c=1, 512 at c=128). Nemotron Super 49B used ~50 prompts at all concurrency levels. Throughput metrics (tok/s) are rate-based and remain valid across prompt counts; however, lower prompt counts at c=1 provide less statistical averaging.
| Metric | Unit | What It Measures |
|---|---|---|
| Output throughput | tok/s | Total output tokens generated per second across all concurrent requests |
| Peak throughput | tok/s | Maximum instantaneous throughput observed during the run |
| TTFT | ms | Time to First Token: latency from request submission to first token received |
| TPOT | ms | Time Per Output Token: average time between consecutive output tokens |
| ITL p99 | ms | Inter-Token Latency at 99th percentile: worst-case token-to-token delay |
| Saturation point | concurrency | The concurrency level where throughput stops increasing meaningfully |
All benchmarks use vLLM's built-in benchmarking tool:
$ vllm bench serve \
--base-url http://localhost:8000 \
--model <model_id> \
--dataset-name random \
--random-input-len 2048 \
--random-output-len 512 \
--num-prompts 400 \
--max-concurrency <N> \
--save-result \
--result-filename results.json
Results are saved as JSON for each concurrency level and aggregated into the tables shown in this cookbook.
Throughput vs latency tradeoff: Higher concurrency always increases throughput until the GPU saturates. At saturation, adding more concurrent requests only increases latency with no throughput gain. The "saturation point" marks where this transition occurs.
TTFT vs TPOT: TTFT measures how long before the user sees the first token (prefill latency). TPOT measures how fast subsequent tokens arrive (decode latency). Interactive applications care most about TTFT; batch processing cares most about throughput.
Per-GPU efficiency: We normalize throughput to tok/s per GPU. This allows comparing different model configurations and deployment strategies on the same hardware regardless of how many GPUs each configuration uses.
All benchmark scripts are included in the scripts/ directory:
serve.sh: Model serving presets (one command per model)bench.sh: Single-model concurrency sweepbench_all.sh: Full pipeline: download, serve, benchmark, cleanupTo reproduce any result:
# 1. Start the model server
$ ./scripts/serve.sh nemotron-nano
# 2. In another terminal, run benchmarks
$ ./scripts/bench.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 http://localhost:8000