Deploy your first model on NVIDIA HGX B200 GPUs and verify the setup end-to-end.
Complete the Environment Setup first. You should have:
LD_LIBRARY_PATH and PATH configurednvidia-smiEvery terminal session needs these variables before running vLLM:
$ source .venv/bin/activate
$ export PATH="/usr/local/cuda/bin:$PATH"
$ NVIDIA_LIB_DIR=".venv/lib/python3.12/site-packages/nvidia"
$ export LD_LIBRARY_PATH="$(find $NVIDIA_LIB_DIR -name 'lib' -type d | tr '\n' ':')$LD_LIBRARY_PATH"
Start with Nemotron Nano 30B — the smallest model in this cookbook and the fastest to download (~15 GB):
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
The first run downloads the model from HuggingFace (~15 GB). Subsequent starts use the cached weights.
Wait for Application startup complete in the logs (~70 seconds with cached model).
$ curl http://localhost:8000/health
$ curl http://localhost:8000/v1/models | python3 -m json.tool
$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 100
}'
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
"prompt": "The NVIDIA HGX B200 GPU features",
"max_tokens": 128
}'
$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
"messages": [{"role": "user", "content": "Explain tensor parallelism in 3 sentences."}],
"max_tokens": 256,
"stream": true
}'
For larger models, increase tensor parallelism:
# DeepSeek V3.2: 685B parameters, requires all 8 GPUs
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code \
--quantization fp8 \
--block-size 1
NVFP4 halves memory compared to FP8, enabling single-GPU deployment:
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
Startup time: ~41 seconds (model cached). This configuration achieves 15,575 tok/s on a single GPU — see the Nemotron Nano guide for full benchmarks.
| Flag | Purpose | Example |
|---|---|---|
--tensor-parallel-size |
Distribute across GPUs | --tensor-parallel-size 8 |
--max-model-len |
Maximum context length | --max-model-len 32768 |
--gpu-memory-utilization |
VRAM allocation fraction | --gpu-memory-utilization 0.90 |
--trust-remote-code |
Required for custom architectures | Always include for cookbook models |
--quantization fp8 |
On-the-fly FP8 quantization | For models without FP8 checkpoints |
--block-size 1 |
Required for MLA models | DeepSeek V3.2 only |
--kv-cache-dtype fp8 |
FP8 KV cache | Reduces per-request memory |
--enforce-eager |
Disable torch.compile | For debugging startup issues |
| Model | TP | Download | Startup | Extra Flags |
|---|---|---|---|---|
| Nemotron Nano FP8 | 2 | ~15 GB | ~70s | --trust-remote-code |
| Nemotron Nano NVFP4 | 1 | ~19 GB | ~41s | --trust-remote-code |
| MiniMax M2.5 | 4 | ~115 GB | ~3 min | --trust-remote-code (vLLM 0.12.0 only) |
| GLM-5 | 8 | ~705 GB | ~8 min | --trust-remote-code |
| DeepSeek V3.2 | 8 | ~642 GB | ~8 min | --quantization fp8 --block-size 1 --trust-remote-code |
Once the server is running, verify performance with a single benchmark:
$ vllm bench serve \
--base-url http://localhost:8000 \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--dataset-name random \
--random-input-len 2048 \
--random-output-len 512 \
--num-prompts 50 \
--max-concurrency 32
Expected output for Nemotron Nano FP8 at c=32: ~3,800 tok/s output throughput, ~206ms TTFT.
# Graceful
$ pkill -f "vllm serve"
# Force (if graceful doesn't work)
$ pkill -9 -f "vllm serve"
# Verify GPUs are freed
$ nvidia-smi --query-gpu=index,memory.used --format=csv
If GPUs still show memory in use after killing the server:
$ fuser /dev/nvidia0 | xargs -r kill -9