Focus Mode

Benchmark Results Overview

Updated on 11 March, 2026

Consolidated benchmark results for all five models on NVIDIA HGX B200 GPUs.

Test Configuration

Parameter	Value
GPUs	8x NVIDIA HGX B200 (179 GB HBM3e each)
Framework	vLLM 0.16.0 (Nemotron, Nemotron Super, GLM-5, DeepSeek) / 0.12.0 (MiniMax)
Input tokens	2,048
Output tokens	512
Dataset	Random (synthetic)
GPU memory utilization	0.90
Concurrency sweep	1, 8, 16, 32, 64, 128, 256, 512, 1024

Peak Throughput Summary

Model	Active Params	TP	GPUs	Peak tok/s	tok/s/GPU	Saturation
Nemotron Nano 30B (FP8)	3B	2	2	18,829	9,415	~512
Nemotron Nano 30B (NVFP4)	3B	1	1	15,575	15,575	~512
Nemotron Super 49B (FP8)	49B	1	1	3,816	3,816	~64*
MiniMax M2.5 229B	10B	4	4	8,838	2,210	~512
GLM-5 744B	40B	8	8	2,132	267	~128
DeepSeek V3.2 685B	37B	8	8	4,370	546	~512

* Nemotron Super 49B exhibits throughput oscillation at c=256 and c=1024 (~1,587 tok/s, roughly half of peak) due to TP=1 batch scheduling effects. Peak throughput is stable at c=64–128 and c=512.

Throughput per Active Parameter

A useful metric for comparing architectural efficiency: how many tokens per second does each active billion parameters produce?

Model	Active Params	Peak tok/s	tok/s per Active-B
Nemotron Nano 30B (FP8)	3B	18,829	6,276
Nemotron Nano 30B (NVFP4)	3B	15,575	5,192
Nemotron Super 49B (FP8)	49B	3,816	78
MiniMax M2.5 229B	10B	8,838	884
GLM-5 744B	40B	2,132	53
DeepSeek V3.2 685B	37B	4,370	118

Nemotron Nano's extreme per-parameter efficiency comes from the Mamba hybrid architecture: SSM layers have no KV cache overhead and process tokens in O(n), keeping the memory bandwidth pipeline saturated. The NVFP4 variant achieves 5,192 tok/s per active-B on a single GPU, making it the most cost-efficient configuration tested.

Latency Summary (at 32 Concurrent)

Concurrency 32 represents a realistic interactive workload: enough load to keep GPUs busy without excessive queuing.

Model	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
Nemotron Nano 30B (FP8)	206	7.86	40.72
Nemotron Nano 30B (NVFP4)	280	6.77	38.31
Nemotron Super 49B (FP8)	172	11.57	12.09
MiniMax M2.5 229B	91	18.82	22.04
GLM-5 744B	1,341	33.56	31.86
DeepSeek V3.2 685B	931	21.34	22.32

Key Observations

Nemotron Nano FP8 has the lowest TTFT at c=32 (206 ms) among all models, with TPOT under 8ms
Nemotron Nano NVFP4 achieves similar latency on a single GPU: slightly higher TTFT (280 ms) due to TP=1 vs TP=2
MiniMax M2.5 has the lowest TTFT at c=32 (91 ms) despite being a larger model, likely because TP=4 provides more compute for prefill
GLM-5 has the highest TTFT at c=32 (1,341 ms) due to DSA's attention computation overhead during prefill, but reasonable TPOT (33.56 ms)
All models achieve ITL p99 under 41ms at c=32, suitable for real-time streaming

Scaling Curves

Output Throughput vs Concurrency

Throughput scaling across concurrency levels for all models

Nemotron Nano FP8 (TP=2) leads at ~18,800 tok/s peak, with NVFP4 (TP=1) close behind at ~15,600 tok/s on a single GPU
MiniMax M2.5 and both Nemotron variants plateau around c=512
DeepSeek V3.2 scales well to c=512 (4,370 tok/s): MLA's compressed KV cache enables higher concurrency
GLM-5 saturates earlier at ~c=128 (2,132 tok/s) due to its larger active parameter count (40B) and lower KV cache capacity

TPOT vs Concurrency

TPOT scaling across concurrency levels

TPOT (decode latency) increases gradually with concurrency as batch sizes grow. Key observations:

Nemotron Nano FP8/NVFP4 maintain the lowest TPOT across all concurrency levels (~3ms at c=1, ~15ms at c=512), reflecting the Mamba hybrid architecture's efficient decode phase
GLM-5 and DeepSeek V3.2 show steeper TPOT growth at high concurrency: their larger active parameter counts (37-40B) consume more memory bandwidth per decode step
Nemotron Super 49B stays remarkably flat (~12ms) up to c=128 before the oscillation pattern kicks in at c=256/1024

TTFT vs Concurrency

TTFT scaling across concurrency levels

TTFT increases with concurrency as prefill requests queue behind active decode operations. MiniMax M2.5 maintains the lowest TTFT across the board due to having more GPUs (TP=4) handling prefill compute.

Per-GPU Throughput Efficiency

Per-GPU throughput efficiency comparison

Per-GPU throughput normalizes for tensor parallelism, revealing how efficiently each model uses its allocated GPUs. Nemotron Nano NVFP4 achieves 15,575 tok/s on a single GPU: 1.65x more per-GPU throughput than the FP8 variant on 2 GPUs, and 58x more than GLM-5 per GPU. This metric drives deployment decisions: fewer GPUs per model instance means more instances per node and higher aggregate throughput.

Goodput (SLO-Constrained Throughput)

Goodput vs raw throughput comparison

Raw throughput tells you the maximum output rate, but production deployments need to meet latency SLOs. Goodput measures how many requests per second meet all SLO targets simultaneously.

Tested on Nemotron Nano 30B FP8 (TP=2) with SLOs: TTFT < 500ms and TPOT < 50ms.

Concurrency	Goodput (req/s)	Output tok/s	Mean TTFT (ms)	Mean TPOT (ms)
1	0.54	281	363	2.85
32	6.10	3,331	688	8.10
64	13.29	8,027	349	6.79
128	6.20	11,343	727	8.72
256	3.65	15,237	1,377	11.78
512	2.08	18,535	2,614	15.70

Peak goodput is at c=64 (13.29 req/s), while peak throughput is at c=512+ (18,800+ tok/s). Beyond c=64, TTFT exceeds the 500ms SLO and goodput drops sharply: at c=512, only ~6% of requests meet both SLOs despite maximum throughput.

This demonstrates the classic throughput-vs-latency trade-off: the goodput-optimal concurrency is 8x lower than the throughput-optimal concurrency. Production deployments should target c=32–64 for interactive workloads with strict SLOs, and c=256–512 only for batch processing where latency is not critical.

Goodput: FP8 (TP=2) vs NVFP4 (TP=1)

Concurrency	FP8 Goodput (req/s)	NVFP4 Goodput (req/s)	FP8 TTFT (ms)	NVFP4 TTFT (ms)
1	0.54	0.54	363	210
32	6.10	7.21	688	487
64	13.29	8.90	349	427
128	6.20	8.24	727	682
256	3.65	8.87	1,377	1,589
512	2.08	0.55	2,614	3,341

FP8 peaks at 13.29 req/s (c=64), NVFP4 peaks at 8.90 req/s (c=64). FP8 has higher peak goodput because TP=2 distributes prefill across 2 GPUs, keeping TTFT lower at moderate concurrency. However, NVFP4 maintains more consistent goodput across c=32–256 (7.21–8.87 req/s) while FP8 drops sharply past c=64. NVFP4 still wins on per-GPU efficiency: it achieves 8.90 goodput req/s on 1 GPU vs FP8's 13.29 on 2 GPUs, delivering better per-GPU throughput even under SLO constraints.

Startup Times (Model Already Cached)

Configuration	Weight Load	Total Startup	Notes
Nemotron Nano NVFP4 (TP=1)	3.6s	~41s	Single GPU, fastest
Nemotron Nano FP8 (TP=2)	4.6s	~70s	2 GPUs, NVLink init

Startup times measured with models already in HuggingFace cache. First-time downloads add 1–5 minutes depending on model size and network speed.

Why Mostly MoE?

Four of the five models in this cookbook use Mixture of Experts (MoE) architecture. This reflects the current state of open-source LLMs:

No dense model above 100B has been released since Llama 3.1 405B (July 2024)
Every major open-source model since mid-2024 uses MoE: DeepSeek V3, GLM-5, MiniMax M2.5, Qwen3, Llama 4
MoE enables larger total parameter counts (more knowledge capacity) while keeping per-token compute manageable

The exception is Nemotron Super 49B, a dense NAS-optimized transformer based on Llama 3.3. It was included specifically because its standard attention architecture is compatible with NVIDIA Dynamo's NIXL KV transfer for disaggregated serving: a key testing requirement the MoE models could not satisfy.

Architecture Diversity

The five models represent five distinct attention mechanisms:

Model	Architecture	Attention Type	Key Property
Nemotron Nano 30B	MoE	Mamba (SSM) + Transformer hybrid	No KV cache for SSM layers, O(n) time
Nemotron Super 49B	Dense	Standard multi-head attention	Full KV cache, compatible with NIXL
MiniMax M2.5 229B	MoE	Lightning Attention (linear + SoftMax)	O(n) intra-chunk, standard inter-chunk
GLM-5 744B	MoE	Differential Sparse Attention (DSA)	Selectively attends to important tokens
DeepSeek V3.2 685B	MoE	Multi-Latent Attention (MLA)	Compressed KV via latent projections

This diversity means each model has fundamentally different memory and compute characteristics during inference, making them interesting comparison points beyond raw throughput numbers.