Consolidated benchmark results for all five models on NVIDIA HGX B200 GPUs.
| Parameter | Value |
|---|---|
| GPUs | 8x NVIDIA HGX B200 (179 GB HBM3e each) |
| Framework | vLLM 0.16.0 (Nemotron, Nemotron Super, GLM-5, DeepSeek) / 0.12.0 (MiniMax) |
| Input tokens | 2,048 |
| Output tokens | 512 |
| Dataset | Random (synthetic) |
| GPU memory utilization | 0.90 |
| Concurrency sweep | 1, 8, 16, 32, 64, 128, 256, 512, 1024 |
| Model | Active Params | TP | GPUs | Peak tok/s | tok/s/GPU | Saturation |
|---|---|---|---|---|---|---|
| Nemotron Nano 30B (FP8) | 3B | 2 | 2 | 18,829 | 9,415 | ~512 |
| Nemotron Nano 30B (NVFP4) | 3B | 1 | 1 | 15,575 | 15,575 | ~512 |
| Nemotron Super 49B (FP8) | 49B | 1 | 1 | 3,816 | 3,816 | ~64* |
| MiniMax M2.5 229B | 10B | 4 | 4 | 8,838 | 2,210 | ~512 |
| GLM-5 744B | 40B | 8 | 8 | 2,132 | 267 | ~128 |
| DeepSeek V3.2 685B | 37B | 8 | 8 | 4,370 | 546 | ~512 |
* Nemotron Super 49B exhibits throughput oscillation at c=256 and c=1024 (~1,587 tok/s, roughly half of peak) due to TP=1 batch scheduling effects. Peak throughput is stable at c=64–128 and c=512.
A useful metric for comparing architectural efficiency: how many tokens per second does each active billion parameters produce?
| Model | Active Params | Peak tok/s | tok/s per Active-B |
|---|---|---|---|
| Nemotron Nano 30B (FP8) | 3B | 18,829 | 6,276 |
| Nemotron Nano 30B (NVFP4) | 3B | 15,575 | 5,192 |
| Nemotron Super 49B (FP8) | 49B | 3,816 | 78 |
| MiniMax M2.5 229B | 10B | 8,838 | 884 |
| GLM-5 744B | 40B | 2,132 | 53 |
| DeepSeek V3.2 685B | 37B | 4,370 | 118 |
Nemotron Nano's extreme per-parameter efficiency comes from the Mamba hybrid architecture: SSM layers have no KV cache overhead and process tokens in O(n), keeping the memory bandwidth pipeline saturated. The NVFP4 variant achieves 5,192 tok/s per active-B on a single GPU, making it the most cost-efficient configuration tested.
Concurrency 32 represents a realistic interactive workload: enough load to keep GPUs busy without excessive queuing.
| Model | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|
| Nemotron Nano 30B (FP8) | 206 | 7.86 | 40.72 |
| Nemotron Nano 30B (NVFP4) | 280 | 6.77 | 38.31 |
| Nemotron Super 49B (FP8) | 172 | 11.57 | 12.09 |
| MiniMax M2.5 229B | 91 | 18.82 | 22.04 |
| GLM-5 744B | 1,341 | 33.56 | 31.86 |
| DeepSeek V3.2 685B | 931 | 21.34 | 22.32 |


TPOT (decode latency) increases gradually with concurrency as batch sizes grow. Key observations:

TTFT increases with concurrency as prefill requests queue behind active decode operations. MiniMax M2.5 maintains the lowest TTFT across the board due to having more GPUs (TP=4) handling prefill compute.

Per-GPU throughput normalizes for tensor parallelism, revealing how efficiently each model uses its allocated GPUs. Nemotron Nano NVFP4 achieves 15,575 tok/s on a single GPU: 1.65x more per-GPU throughput than the FP8 variant on 2 GPUs, and 58x more than GLM-5 per GPU. This metric drives deployment decisions: fewer GPUs per model instance means more instances per node and higher aggregate throughput.

Raw throughput tells you the maximum output rate, but production deployments need to meet latency SLOs. Goodput measures how many requests per second meet all SLO targets simultaneously.
Tested on Nemotron Nano 30B FP8 (TP=2) with SLOs: TTFT < 500ms and TPOT < 50ms.
| Concurrency | Goodput (req/s) | Output tok/s | Mean TTFT (ms) | Mean TPOT (ms) |
|---|---|---|---|---|
| 1 | 0.54 | 281 | 363 | 2.85 |
| 32 | 6.10 | 3,331 | 688 | 8.10 |
| 64 | 13.29 | 8,027 | 349 | 6.79 |
| 128 | 6.20 | 11,343 | 727 | 8.72 |
| 256 | 3.65 | 15,237 | 1,377 | 11.78 |
| 512 | 2.08 | 18,535 | 2,614 | 15.70 |
Peak goodput is at c=64 (13.29 req/s), while peak throughput is at c=512+ (18,800+ tok/s). Beyond c=64, TTFT exceeds the 500ms SLO and goodput drops sharply: at c=512, only ~6% of requests meet both SLOs despite maximum throughput.
This demonstrates the classic throughput-vs-latency trade-off: the goodput-optimal concurrency is 8x lower than the throughput-optimal concurrency. Production deployments should target c=32–64 for interactive workloads with strict SLOs, and c=256–512 only for batch processing where latency is not critical.
| Concurrency | FP8 Goodput (req/s) | NVFP4 Goodput (req/s) | FP8 TTFT (ms) | NVFP4 TTFT (ms) |
|---|---|---|---|---|
| 1 | 0.54 | 0.54 | 363 | 210 |
| 32 | 6.10 | 7.21 | 688 | 487 |
| 64 | 13.29 | 8.90 | 349 | 427 |
| 128 | 6.20 | 8.24 | 727 | 682 |
| 256 | 3.65 | 8.87 | 1,377 | 1,589 |
| 512 | 2.08 | 0.55 | 2,614 | 3,341 |
FP8 peaks at 13.29 req/s (c=64), NVFP4 peaks at 8.90 req/s (c=64). FP8 has higher peak goodput because TP=2 distributes prefill across 2 GPUs, keeping TTFT lower at moderate concurrency. However, NVFP4 maintains more consistent goodput across c=32–256 (7.21–8.87 req/s) while FP8 drops sharply past c=64. NVFP4 still wins on per-GPU efficiency: it achieves 8.90 goodput req/s on 1 GPU vs FP8's 13.29 on 2 GPUs, delivering better per-GPU throughput even under SLO constraints.
| Configuration | Weight Load | Total Startup | Notes |
|---|---|---|---|
| Nemotron Nano NVFP4 (TP=1) | 3.6s | ~41s | Single GPU, fastest |
| Nemotron Nano FP8 (TP=2) | 4.6s | ~70s | 2 GPUs, NVLink init |
Startup times measured with models already in HuggingFace cache. First-time downloads add 1–5 minutes depending on model size and network speed.
Four of the five models in this cookbook use Mixture of Experts (MoE) architecture. This reflects the current state of open-source LLMs:
The exception is Nemotron Super 49B, a dense NAS-optimized transformer based on Llama 3.3. It was included specifically because its standard attention architecture is compatible with NVIDIA Dynamo's NIXL KV transfer for disaggregated serving: a key testing requirement the MoE models could not satisfy.
The five models represent five distinct attention mechanisms:
| Model | Architecture | Attention Type | Key Property |
|---|---|---|---|
| Nemotron Nano 30B | MoE | Mamba (SSM) + Transformer hybrid | No KV cache for SSM layers, O(n) time |
| Nemotron Super 49B | Dense | Standard multi-head attention | Full KV cache, compatible with NIXL |
| MiniMax M2.5 229B | MoE | Lightning Attention (linear + SoftMax) | O(n) intra-chunk, standard inter-chunk |
| GLM-5 744B | MoE | Differential Sparse Attention (DSA) | Selectively attends to important tokens |
| DeepSeek V3.2 685B | MoE | Multi-Latent Attention (MLA) | Compressed KV via latent projections |
This diversity means each model has fundamentally different memory and compute characteristics during inference, making them interesting comparison points beyond raw throughput numbers.