Nemotron 3 Nano 30B

Updated on 11 March, 2026

Deploy NVIDIA's Nemotron 3 Nano on NVIDIA HGX B200 GPUs. This hybrid Mamba-Transformer model delivers high throughput with only 3B active parameters per token.


Model Overview

Property Value
Model ID nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
Architecture Hybrid Mamba-Transformer MoE
Total Parameters 30B
Active Parameters 3B per token
Attention Mamba (linear/SSM) + Standard Transformer (interleaved)
Routing MoE (128 experts)
Context Length 1,048,576 tokens (1M)
Quantization FP8 (also available: NVFP4, BF16)
License NVIDIA Open Model License
Link HuggingFace

Architecture

Nemotron Nano uses a hybrid Mamba-Transformer architecture — the first production model to combine State Space Model (SSM) layers with traditional attention layers at scale. Mamba layers process sequences in linear time (O(n)) without maintaining a KV cache, while Transformer layers handle tasks that benefit from full attention. The MoE routing activates only 3B of the 30B total parameters per token.

This architecture has three implications for NVIDIA HGX B200 deployment:

  1. Minimal KV cache: Mamba layers have no KV cache, so memory consumption grows much slower with context length than pure-attention models
  2. High throughput at low TP: Only 3B active params means a single NVIDIA HGX B200 has more than enough compute
  3. NVFP4 support: The NVIDIA HGX B200's native NVFP4 format can further halve memory usage versus FP8

Quick Start

console
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Or with Docker:

console
$ docker run --rm --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.16.0 \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code
Note
--trust-remote-code is required — the nemotron_h architecture uses custom model code.

Configuration

Flag Purpose
--tensor-parallel-size 2 Distribute across 2 GPUs. TP=1 also works but leaves less room for KV cache
--max-model-len 32768 Context window. Increase up to 1M for long-context workloads
--gpu-memory-utilization 0.90 Reserve 90% of VRAM for model + KV cache
--trust-remote-code Required for custom nemotron_h architecture
--kv-cache-dtype fp8 Optional: FP8 KV cache for even lower memory usage

Memory Usage (NVIDIA HGX B200 Verified)

With TP=2 on FP8:

Component Per GPU Total (2 GPUs)
Model weights ~15 GB ~30 GB
KV cache (available) ~139 GB ~278 GB
VRAM used 169 GB 338 GB
Max concurrent (32K ctx) 1,298x -

The model uses only 169 GB per GPU out of 179 GB available, leaving 139 GB per GPU for KV cache — enough for over 1,200 concurrent requests at 32K context length.

Performance (NVIDIA HGX B200 Verified)

Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=2 on 2x NVIDIA HGX B200.

Concurrency Scaling

Concurrent Output tok/s Peak tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 277 344 362 2.91 3.14
8 1,881 2,264 152 3.96 4.28
16 3,455 3,952 160 4.32 5.19
32 3,798 4,416 206 7.86 40.72
64 8,120 10,688 324 6.74 43.82
128 11,556 17,536 520 8.91 45.19
256 15,552 26,908 1,254 11.77 48.20
512 18,746 32,935 2,610 15.46 47.88
1024 18,829 32,780 2,563 15.46 47.75

Zero failed requests across all concurrency levels.

Peak Performance

Metric Value
Peak sustained throughput 18,829 tok/s (output)
Peak burst throughput 32,935 tok/s
Saturation point ~512 concurrent requests
TTFT at 32 concurrent 206 ms
TPOT at 32 concurrent 7.86 ms

Scaling Behavior

Throughput scales linearly from concurrency 1 to 256 (~56x increase), then plateaus around 512 concurrent. The model saturates at approximately 18,800 tok/s sustained output throughput on 2 GPUs. TTFT degrades linearly with concurrency — expected behavior as prefill requests queue. TPOT remains under 8ms up to 32 concurrent, making this suitable for real-time applications at moderate load.

Note
This model uses only 2 of 8 available NVIDIA HGX B200 GPUs. In production, you could serve 4 independent instances on a single 8-GPU node, achieving ~75,000 tok/s aggregate throughput.

Test Endpoints

Chat Completion

console
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "messages": [{"role": "user", "content": "Explain KV caching in transformers"}],
    "max_tokens": 256
  }'

Text Completion

console
$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "prompt": "The NVIDIA HGX B200 GPU features",
    "max_tokens": 128
  }'

NVFP4 Variant

NVIDIA HGX B200 supports NVIDIA's FP4 format (NVFP4), which halves memory compared to FP8. The NVFP4 variant of Nemotron Nano can run on a single GPU:

console
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

NVFP4 Performance (NVIDIA HGX B200 Verified)

Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=1 on 1x NVIDIA HGX B200.

Concurrent Output tok/s Peak tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 257 320 387 3.13 3.26
8 1,636 1,960 298 4.32 4.49
16 2,902 3,392 203 5.12 6.22
32 4,264 5,189 280 6.77 38.31
64 6,401 8,576 507 8.44 56.80
128 9,153 13,821 692 11.26 59.11
256 12,998 20,226 1,644 14.38 64.02
512 15,496 28,259 3,378 18.51 63.90
1024 15,575 28,230 3,322 18.49 64.33

FP8 vs NVFP4 Comparison

Nemotron Nano FP8 vs NVFP4 performance comparison

Metric FP8 (TP=2, 2 GPUs) NVFP4 (TP=1, 1 GPU)
Peak sustained tok/s 18,829 15,575
tok/s per GPU 9,415 15,575
VRAM used 169 GB × 2 = 338 GB 173 GB × 1 = 173 GB

NVFP4 delivers 1.65x better cost efficiency than FP8 by eliminating the second GPU entirely. On a single 8-GPU node, you could run 8 NVFP4 instances (~124,000 tok/s aggregate) vs 4 FP8 instances (~75,000 tok/s aggregate). NVFP4 is an NVIDIA HGX B200-exclusive feature not available on prior GPU generations.

Goodput: FP8 vs NVFP4 (SLOs: TTFT<500ms, TPOT<50ms)

Concurrency FP8 Goodput (req/s) NVFP4 Goodput (req/s)
32 6.10 7.21
64 13.29 8.90
128 6.20 8.24
256 3.65 8.87

FP8 has higher peak goodput (13.29 vs 8.90 at c=64) because TP=2 distributes prefill across 2 GPUs, keeping TTFT lower. But NVFP4 uses half the GPUs (1 vs 2), delivering better per-GPU efficiency.

See FP8/NVFP4 Quantization for more details.

Startup Time (Model Cached)

Configuration Weight Load Total Startup
FP8 (TP=2) 4.6s ~70s
NVFP4 (TP=1) 3.6s ~41s

First-time model download adds ~1 minute for Nemotron Nano (~15–31 GB).

Known Issues

  • MoE config warning: vLLM logs Using default MoE config. Performance might be sub-optimal! because no tuned FP8 MoE kernel config exists for NVIDIA HGX B200 yet. Performance is still strong — a tuned config would likely improve throughput further.
  • Custom code: Always pass --trust-remote-code. Without it, vLLM cannot load the nemotron_h architecture.

Comments