Focus Mode

Benchmark Methodology

Updated on 11 March, 2026

How we measure LLM inference performance on NVIDIA HGX B200 GPUs.

Test Parameters

Parameter	Value	Rationale
Input tokens	2,048	Represents a typical prompt with context
Output tokens	512	Standard generation length
Dataset	Random (synthetic)	Eliminates tokenizer variance. Consistent across models
Prompts per run	50 (c=1), 400 (c>1)	Enough for stable averages at each concurrency level (see note)
Concurrency levels	1, 8, 16, 32, 64, 128, 256, 512, 1024	Sweep from single-user to saturation
Serving framework	vLLM 0.12.0 / 0.16.0	See note below
GPU utilization	0.90	90% of VRAM allocated for model + KV cache

Note

vLLM version note: Nemotron Nano, Nemotron Super 49B, GLM-5, and DeepSeek V3.2 were benchmarked on vLLM 0.16.0. MiniMax M2.5 was benchmarked on vLLM 0.12.0 because vLLM 0.16.0 crashes with MiniMax's MoE routing (n_group=0 incompatible with the fused DeepSeekV3 routing kernel). Both versions use the same vllm bench serve interface with identical parameters.

Note

Prompt count note: The 50/400 split was used for Nemotron Nano and MiniMax M2.5 via bench_all.sh. DeepSeek V3.2 and GLM-5 were benchmarked with 4× concurrency prompts per level (e.g., 4 at c=1, 512 at c=128). Nemotron Super 49B used ~50 prompts at all concurrency levels. Throughput metrics (tok/s) are rate-based and remain valid across prompt counts; however, lower prompt counts at c=1 provide less statistical averaging.

Metrics

Metric	Unit	What It Measures
Output throughput	tok/s	Total output tokens generated per second across all concurrent requests
Peak throughput	tok/s	Maximum instantaneous throughput observed during the run
TTFT	ms	Time to First Token: latency from request submission to first token received
TPOT	ms	Time Per Output Token: average time between consecutive output tokens
ITL p99	ms	Inter-Token Latency at 99th percentile: worst-case token-to-token delay
Saturation point	concurrency	The concurrency level where throughput stops increasing meaningfully

Tools

All benchmarks use vLLM's built-in benchmarking tool:

                            console
                            
                        
$ vllm bench serve \
  --base-url http://localhost:8000 \
  --model <model_id> \
  --dataset-name random \
  --random-input-len 2048 \
  --random-output-len 512 \
  --num-prompts 400 \
  --max-concurrency <N> \
  --save-result \
  --result-filename results.json

Results are saved as JSON for each concurrency level and aggregated into the tables shown in this cookbook.

How to Read the Results

Throughput vs latency tradeoff: Higher concurrency always increases throughput until the GPU saturates. At saturation, adding more concurrent requests only increases latency with no throughput gain. The "saturation point" marks where this transition occurs.

TTFT vs TPOT: TTFT measures how long before the user sees the first token (prefill latency). TPOT measures how fast subsequent tokens arrive (decode latency). Interactive applications care most about TTFT; batch processing cares most about throughput.

Per-GPU efficiency: We normalize throughput to tok/s per GPU. This allows comparing different model configurations and deployment strategies on the same hardware regardless of how many GPUs each configuration uses.

Reproducibility

All benchmark scripts are included in the scripts/ directory:

serve.sh: Model serving presets (one command per model)
bench.sh: Single-model concurrency sweep
bench_all.sh: Full pipeline: download, serve, benchmark, cleanup

To reproduce any result:

                            console
                            
                        
# 1. Start the model server
$ ./scripts/serve.sh nemotron-nano

# 2. In another terminal, run benchmarks
$ ./scripts/bench.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 http://localhost:8000

Benchmark Methodology

Test Parameters

Metrics

Tools

How to Read the Results

Reproducibility

Comments

Products

Features

Solutions

Marketplace

Resources

Company