Benchmark Methodology

Updated on 11 March, 2026

How we measure LLM inference performance on NVIDIA HGX B200 GPUs.


Test Parameters

Parameter Value Rationale
Input tokens 2,048 Represents a typical prompt with context
Output tokens 512 Standard generation length
Dataset Random (synthetic) Eliminates tokenizer variance. Consistent across models
Prompts per run 50 (c=1), 400 (c>1) Enough for stable averages at each concurrency level (see note)
Concurrency levels 1, 8, 16, 32, 64, 128, 256, 512, 1024 Sweep from single-user to saturation
Serving framework vLLM 0.12.0 / 0.16.0 See note below
GPU utilization 0.90 90% of VRAM allocated for model + KV cache
Note
vLLM version note: Nemotron Nano, Nemotron Super 49B, GLM-5, and DeepSeek V3.2 were benchmarked on vLLM 0.16.0. MiniMax M2.5 was benchmarked on vLLM 0.12.0 because vLLM 0.16.0 crashes with MiniMax's MoE routing (n_group=0 incompatible with the fused DeepSeekV3 routing kernel). Both versions use the same vllm bench serve interface with identical parameters.
Note
Prompt count note: The 50/400 split was used for Nemotron Nano and MiniMax M2.5 via bench_all.sh. DeepSeek V3.2 and GLM-5 were benchmarked with 4× concurrency prompts per level (e.g., 4 at c=1, 512 at c=128). Nemotron Super 49B used ~50 prompts at all concurrency levels. Throughput metrics (tok/s) are rate-based and remain valid across prompt counts; however, lower prompt counts at c=1 provide less statistical averaging.

Metrics

Metric Unit What It Measures
Output throughput tok/s Total output tokens generated per second across all concurrent requests
Peak throughput tok/s Maximum instantaneous throughput observed during the run
TTFT ms Time to First Token: latency from request submission to first token received
TPOT ms Time Per Output Token: average time between consecutive output tokens
ITL p99 ms Inter-Token Latency at 99th percentile: worst-case token-to-token delay
Saturation point concurrency The concurrency level where throughput stops increasing meaningfully

Tools

All benchmarks use vLLM's built-in benchmarking tool:

console
$ vllm bench serve \
  --base-url http://localhost:8000 \
  --model <model_id> \
  --dataset-name random \
  --random-input-len 2048 \
  --random-output-len 512 \
  --num-prompts 400 \
  --max-concurrency <N> \
  --save-result \
  --result-filename results.json

Results are saved as JSON for each concurrency level and aggregated into the tables shown in this cookbook.

How to Read the Results

Throughput vs latency tradeoff: Higher concurrency always increases throughput until the GPU saturates. At saturation, adding more concurrent requests only increases latency with no throughput gain. The "saturation point" marks where this transition occurs.

TTFT vs TPOT: TTFT measures how long before the user sees the first token (prefill latency). TPOT measures how fast subsequent tokens arrive (decode latency). Interactive applications care most about TTFT; batch processing cares most about throughput.

Per-GPU efficiency: We normalize throughput to tok/s per GPU. This allows comparing different model configurations and deployment strategies on the same hardware regardless of how many GPUs each configuration uses.

Reproducibility

All benchmark scripts are included in the scripts/ directory:

  • serve.sh: Model serving presets (one command per model)
  • bench.sh: Single-model concurrency sweep
  • bench_all.sh: Full pipeline: download, serve, benchmark, cleanup

To reproduce any result:

console
# 1. Start the model server
$ ./scripts/serve.sh nemotron-nano

# 2. In another terminal, run benchmarks
$ ./scripts/bench.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 http://localhost:8000

Comments