Focus Mode

First Deployment

Updated on 12 March, 2026

Deploy your first model on NVIDIA HGX B200 GPUs and verify the setup end-to-end.

Prerequisites

Complete the Environment Setup first. You should have:

vLLM 0.16.0 installed in a virtual environment
LD_LIBRARY_PATH and PATH configured
All 8 NVIDIA HGX B200 GPUs visible in nvidia-smi

Set Environment

Every terminal session needs these variables before running vLLM:

                            console
                            
                        
$ source .venv/bin/activate
$ export PATH="/usr/local/cuda/bin:$PATH"
$ NVIDIA_LIB_DIR=".venv/lib/python3.12/site-packages/nvidia"
$ export LD_LIBRARY_PATH="$(find $NVIDIA_LIB_DIR -name 'lib' -type d | tr '\n' ':')$LD_LIBRARY_PATH"

Basic Deployment

Start with Nemotron Nano 30B — the smallest model in this cookbook and the fastest to download (~15 GB):

                            console
                            
                        
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

The first run downloads the model from HuggingFace (~15 GB). Subsequent starts use the cached weights.

Wait for Application startup complete in the logs (~70 seconds with cached model).

Test the API

Health Check

console

$ curl http://localhost:8000/health

List Models

                            console
                            
$ curl http://localhost:8000/v1/models | python3 -m json.tool

Chat Completion

                            console
                            
                        
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 100
  }'

Text Completion

                            console
                            
                        
$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "prompt": "The NVIDIA HGX B200 GPU features",
    "max_tokens": 128
  }'

Streaming

                            console
                            
                        
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "messages": [{"role": "user", "content": "Explain tensor parallelism in 3 sentences."}],
    "max_tokens": 256,
    "stream": true
  }'

Multi-GPU Deployment

For larger models, increase tensor parallelism:

                            console
                            
                        
# DeepSeek V3.2: 685B parameters, requires all 8 GPUs
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --quantization fp8 \
  --block-size 1

Note

DeepSeek V3.2 is ~642 GB to download and requires ~5 minutes of DeepGEMM JIT kernel compilation on first launch.

NVFP4 Single-GPU Deployment (NVIDIA HGX B200 Exclusive)

NVFP4 halves memory compared to FP8, enabling single-GPU deployment:

                            console
                            
                        
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Startup time: ~41 seconds (model cached). This configuration achieves 15,575 tok/s on a single GPU — see the Nemotron Nano guide for full benchmarks.

Common Flags

Flag	Purpose	Example
`--tensor-parallel-size`	Distribute across GPUs	`--tensor-parallel-size 8`
`--max-model-len`	Maximum context length	`--max-model-len 32768`
`--gpu-memory-utilization`	VRAM allocation fraction	`--gpu-memory-utilization 0.90`
`--trust-remote-code`	Required for custom architectures	Always include for cookbook models
`--quantization fp8`	On-the-fly FP8 quantization	For models without FP8 checkpoints
`--block-size 1`	Required for MLA models	DeepSeek V3.2 only
`--kv-cache-dtype fp8`	FP8 KV cache	Reduces per-request memory
`--enforce-eager`	Disable torch.compile	For debugging startup issues

Model Quick Reference

Model	TP	Download	Startup	Extra Flags
Nemotron Nano FP8	2	~15 GB	~70s	`--trust-remote-code`
Nemotron Nano NVFP4	1	~19 GB	~41s	`--trust-remote-code`
MiniMax M2.5	4	~115 GB	~3 min	`--trust-remote-code` (vLLM 0.12.0 only)
GLM-5	8	~705 GB	~8 min	`--trust-remote-code`
DeepSeek V3.2	8	~642 GB	~8 min	`--quantization fp8 --block-size 1 --trust-remote-code`

Run a Quick Benchmark

Once the server is running, verify performance with a single benchmark:

                            console
                            
                        
$ vllm bench serve \
  --base-url http://localhost:8000 \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --dataset-name random \
  --random-input-len 2048 \
  --random-output-len 512 \
  --num-prompts 50 \
  --max-concurrency 32

Expected output for Nemotron Nano FP8 at c=32: ~3,800 tok/s output throughput, ~206ms TTFT.

Stopping the Server

                            console
                            
                        
# Graceful
$ pkill -f "vllm serve"

# Force (if graceful doesn't work)
$ pkill -9 -f "vllm serve"

# Verify GPUs are freed
$ nvidia-smi --query-gpu=index,memory.used --format=csv

If GPUs still show memory in use after killing the server:

                            console
                            
$ fuser /dev/nvidia0 | xargs -r kill -9

Next Steps

Nemotron Nano Guide: Full benchmarks and NVFP4 analysis
Concurrency Tuning: Maximize throughput
FP8/NVFP4 Quantization: Quantization formats and trade-offs