Set up an NVIDIA HGX B200 instance for LLM inference with vLLM.
$ nvidia-smi
Expected output (abbreviated):
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
|----------+------------------------+----------------------+
| GPU Name | Memory-Usage | GPU-Util |
|----------+------------------------+----------------------+
| 0 B200 | 4MiB / 183359MiB | 0% |
| 1 B200 | 4MiB / 183359MiB | 0% |
| 2 B200 | 4MiB / 183359MiB | 0% |
| 3 B200 | 4MiB / 183359MiB | 0% |
| 4 B200 | 4MiB / 183359MiB | 0% |
| 5 B200 | 4MiB / 183359MiB | 0% |
| 6 B200 | 4MiB / 183359MiB | 0% |
| 7 B200 | 4MiB / 183359MiB | 0% |
+----------+------------------------+----------------------+Confirm all 8 GPUs are visible and persistence mode is on:
$ nvidia-smi --query-gpu=index,name,memory.total,persistence_mode --format=csv
The NVIDIA HGX B200 uses NVSwitch 5.0 for all-to-all GPU communication. Verify with:
$ nvidia-smi topo -m
All GPU pairs should show NV connections (NVLink via NVSwitch), not PHB or SYS.
We recommend using uv for fast dependency resolution:
# Install uv
$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ export PATH="$HOME/.local/bin:$PATH"
# Create project directory and virtual environment
$ mkdir -p ~/b200-cookbook && cd ~/b200-cookbook
$ uv venv --python 3.12 .venv
$ source .venv/bin/activate
# Install dependencies
$ uv pip install vllm huggingface-hub
# Set required environment variables
$ export PATH="/usr/local/cuda/bin:$PATH"
$ NVIDIA_LIB_DIR=".venv/lib/python3.12/site-packages/nvidia"
$ export LD_LIBRARY_PATH="$(find $NVIDIA_LIB_DIR -name 'lib' -type d | tr '\n' ':')$LD_LIBRARY_PATH"
PATH addition is needed for FlashInfer's JIT compilation (requires nvcc). The LD_LIBRARY_PATH addition resolves CUDA runtime library mismatches between the PyPI wheel and the system CUDA installation. See Troubleshooting for details.
Verify the installation:
$ python3 -c "import vllm; print(f'vLLM {vllm.__version__}')"
$ python3 -c "import torch; print(f'PyTorch {torch.__version__}, CUDA {torch.version.cuda}, GPUs: {torch.cuda.device_count()}')"
Expected output:
vLLM 0.16.0
PyTorch 2.9.1+cu130, CUDA 13.0, GPUs: 8NVIDIA HGX B200 GPUs use compute capability sm_100 (Blackwell architecture). Verify:
$ python3 -c "import torch; print(torch.cuda.get_device_capability())"
Expected: (10, 0)
n_group=0) is incompatible with the fused kernel in 0.16.0. All other models (Nemotron Nano, Nemotron Super, GLM-5, DeepSeek V3.2) use vLLM 0.16.0. See Troubleshooting: MiniMax M2.5 for details.
For disaggregated serving experiments (prefill/decode separation):
$ uv pip install "ai-dynamo[vllm]"
See the Dynamo chapter for setup and benchmarks.
Docker avoids LD_LIBRARY_PATH issues and pins the exact vLLM version. This is the recommended approach for production.
$ docker pull vllm/vllm-openai:v0.16.0
$ docker run --rm --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.16.0 \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
| Flag | Purpose |
|---|---|
--gpus all |
Expose all 8 NVIDIA HGX B200 GPUs to the container |
--ipc=host |
Share host memory for tensor parallel communication |
-v ~/.cache/huggingface:... |
Persist downloaded model weights across container restarts |
--rm |
Remove container on exit (omit for production) |
services:
vllm:
image: vllm/vllm-openai:v0.16.0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN}
command: >
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
--tensor-parallel-size 2
--max-model-len 32768
--gpu-memory-utilization 0.90
--trust-remote-code
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 300s
$ docker compose up -d
$ docker compose logs -f vllm # watch startup
--ipc=host is required for multi-GPU tensor parallelism. Without it, NCCL communication will fail. On default Vultr instances, Docker requires sudo — add your user to the docker group with sudo usermod -aG docker $USER.
This cookbook uses the following layout:
b200-cookbook/
├── .venv/ # Python virtual environment
├── scripts/
│ ├── serve.sh # Model serving presets
│ ├── bench.sh # Single-model benchmark runner
│ └── bench_all.sh # Full benchmark pipeline
├── results/ # Benchmark output (JSON + logs)
│ ├── nemotron-nano-fp8/
│ ├── nemotron-nano-nvfp4/
│ ├── nemotron-super-49b-fp8/
│ ├── nemotron-super-49b-bf16/
│ ├── minimax-m25/
│ ├── glm5/
│ ├── deepseek-v32/
│ └── dynamo-*/
└── docs/ # This cookbook