First Deployment

Updated on 12 March, 2026

Deploy your first model on NVIDIA HGX B200 GPUs and verify the setup end-to-end.


Prerequisites

Complete the Environment Setup first. You should have:

  • vLLM 0.16.0 installed in a virtual environment
  • LD_LIBRARY_PATH and PATH configured
  • All 8 NVIDIA HGX B200 GPUs visible in nvidia-smi

Set Environment

Every terminal session needs these variables before running vLLM:

console
$ source .venv/bin/activate
$ export PATH="/usr/local/cuda/bin:$PATH"
$ NVIDIA_LIB_DIR=".venv/lib/python3.12/site-packages/nvidia"
$ export LD_LIBRARY_PATH="$(find $NVIDIA_LIB_DIR -name 'lib' -type d | tr '\n' ':')$LD_LIBRARY_PATH"

Basic Deployment

Start with Nemotron Nano 30B — the smallest model in this cookbook and the fastest to download (~15 GB):

console
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

The first run downloads the model from HuggingFace (~15 GB). Subsequent starts use the cached weights.

Wait for Application startup complete in the logs (~70 seconds with cached model).

Test the API

Health Check

console
$ curl http://localhost:8000/health

List Models

console
$ curl http://localhost:8000/v1/models | python3 -m json.tool

Chat Completion

console
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 100
  }'

Text Completion

console
$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "prompt": "The NVIDIA HGX B200 GPU features",
    "max_tokens": 128
  }'

Streaming

console
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    "messages": [{"role": "user", "content": "Explain tensor parallelism in 3 sentences."}],
    "max_tokens": 256,
    "stream": true
  }'

Multi-GPU Deployment

For larger models, increase tensor parallelism:

console
# DeepSeek V3.2: 685B parameters, requires all 8 GPUs
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --quantization fp8 \
  --block-size 1
Note
DeepSeek V3.2 is ~642 GB to download and requires ~5 minutes of DeepGEMM JIT kernel compilation on first launch.

NVFP4 Single-GPU Deployment (NVIDIA HGX B200 Exclusive)

NVFP4 halves memory compared to FP8, enabling single-GPU deployment:

console
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Startup time: ~41 seconds (model cached). This configuration achieves 15,575 tok/s on a single GPU — see the Nemotron Nano guide for full benchmarks.

Common Flags

Flag Purpose Example
--tensor-parallel-size Distribute across GPUs --tensor-parallel-size 8
--max-model-len Maximum context length --max-model-len 32768
--gpu-memory-utilization VRAM allocation fraction --gpu-memory-utilization 0.90
--trust-remote-code Required for custom architectures Always include for cookbook models
--quantization fp8 On-the-fly FP8 quantization For models without FP8 checkpoints
--block-size 1 Required for MLA models DeepSeek V3.2 only
--kv-cache-dtype fp8 FP8 KV cache Reduces per-request memory
--enforce-eager Disable torch.compile For debugging startup issues

Model Quick Reference

Model TP Download Startup Extra Flags
Nemotron Nano FP8 2 ~15 GB ~70s --trust-remote-code
Nemotron Nano NVFP4 1 ~19 GB ~41s --trust-remote-code
MiniMax M2.5 4 ~115 GB ~3 min --trust-remote-code (vLLM 0.12.0 only)
GLM-5 8 ~705 GB ~8 min --trust-remote-code
DeepSeek V3.2 8 ~642 GB ~8 min --quantization fp8 --block-size 1 --trust-remote-code

Run a Quick Benchmark

Once the server is running, verify performance with a single benchmark:

console
$ vllm bench serve \
  --base-url http://localhost:8000 \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --dataset-name random \
  --random-input-len 2048 \
  --random-output-len 512 \
  --num-prompts 50 \
  --max-concurrency 32

Expected output for Nemotron Nano FP8 at c=32: ~3,800 tok/s output throughput, ~206ms TTFT.

Stopping the Server

console
# Graceful
$ pkill -f "vllm serve"

# Force (if graceful doesn't work)
$ pkill -9 -f "vllm serve"

# Verify GPUs are freed
$ nvidia-smi --query-gpu=index,memory.used --format=csv

If GPUs still show memory in use after killing the server:

console
$ fuser /dev/nvidia0 | xargs -r kill -9

Next Steps

Comments