Environment Setup

Updated on 11 March, 2026

Set up an NVIDIA HGX B200 instance for LLM inference with vLLM.


Prerequisites

  • NVIDIA HGX B200 instance (8x NVIDIA HGX B200 GPUs)
  • Ubuntu 22.04+ with NVIDIA driver 580+
  • Python 3.12

Verify GPU Hardware

console
$ nvidia-smi

Expected output (abbreviated):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08    Driver Version: 580.105.08    CUDA Version: 13.0               |
|----------+------------------------+----------------------+
| GPU Name | Memory-Usage           | GPU-Util             |
|----------+------------------------+----------------------+
|   0  B200 |   4MiB / 183359MiB    |      0%              |
|   1  B200 |   4MiB / 183359MiB    |      0%              |
|   2  B200 |   4MiB / 183359MiB    |      0%              |
|   3  B200 |   4MiB / 183359MiB    |      0%              |
|   4  B200 |   4MiB / 183359MiB    |      0%              |
|   5  B200 |   4MiB / 183359MiB    |      0%              |
|   6  B200 |   4MiB / 183359MiB    |      0%              |
|   7  B200 |   4MiB / 183359MiB    |      0%              |
+----------+------------------------+----------------------+

Confirm all 8 GPUs are visible and persistence mode is on:

console
$ nvidia-smi --query-gpu=index,name,memory.total,persistence_mode --format=csv

The NVIDIA HGX B200 uses NVSwitch 5.0 for all-to-all GPU communication. Verify with:

console
$ nvidia-smi topo -m

All GPU pairs should show NV connections (NVLink via NVSwitch), not PHB or SYS.

Install vLLM

We recommend using uv for fast dependency resolution:

console
# Install uv
$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ export PATH="$HOME/.local/bin:$PATH"

# Create project directory and virtual environment
$ mkdir -p ~/b200-cookbook && cd ~/b200-cookbook
$ uv venv --python 3.12 .venv
$ source .venv/bin/activate

# Install dependencies
$ uv pip install vllm huggingface-hub

# Set required environment variables
$ export PATH="/usr/local/cuda/bin:$PATH"
$ NVIDIA_LIB_DIR=".venv/lib/python3.12/site-packages/nvidia"
$ export LD_LIBRARY_PATH="$(find $NVIDIA_LIB_DIR -name 'lib' -type d | tr '\n' ':')$LD_LIBRARY_PATH"
Note
The PATH addition is needed for FlashInfer's JIT compilation (requires nvcc). The LD_LIBRARY_PATH addition resolves CUDA runtime library mismatches between the PyPI wheel and the system CUDA installation. See Troubleshooting for details.

Verify the installation:

console
$ python3 -c "import vllm; print(f'vLLM {vllm.__version__}')"
$ python3 -c "import torch; print(f'PyTorch {torch.__version__}, CUDA {torch.version.cuda}, GPUs: {torch.cuda.device_count()}')"

Expected output:

vLLM 0.16.0
PyTorch 2.9.1+cu130, CUDA 13.0, GPUs: 8

GPU Compute Capability

NVIDIA HGX B200 GPUs use compute capability sm_100 (Blackwell architecture). Verify:

console
$ python3 -c "import torch; print(torch.cuda.get_device_capability())"

Expected: (10, 0)

Note
vLLM 0.16.0 is recommended for NVIDIA HGX B200. It adds support for GLM-5's DSA architecture, FlashInfer MLA backend for DeepSeek, and TRT-LLM ragged prefill optimizations. Requires PyTorch 2.9.1+cu130. If you encounter compatibility issues, see Troubleshooting.
Note
Version compatibility: MiniMax M2.5 requires vLLM 0.12.0 — its MoE routing (n_group=0) is incompatible with the fused kernel in 0.16.0. All other models (Nemotron Nano, Nemotron Super, GLM-5, DeepSeek V3.2) use vLLM 0.16.0. See Troubleshooting: MiniMax M2.5 for details.

Optional: NVIDIA Dynamo

For disaggregated serving experiments (prefill/decode separation):

console
$ uv pip install "ai-dynamo[vllm]"

See the Dynamo chapter for setup and benchmarks.

Docker Setup

Docker avoids LD_LIBRARY_PATH issues and pins the exact vLLM version. This is the recommended approach for production.

console
$ docker pull vllm/vllm-openai:v0.16.0

Quick Test

console
$ docker run --rm --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.16.0 \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Docker Flags Explained

Flag Purpose
--gpus all Expose all 8 NVIDIA HGX B200 GPUs to the container
--ipc=host Share host memory for tensor parallel communication
-v ~/.cache/huggingface:... Persist downloaded model weights across container restarts
--rm Remove container on exit (omit for production)

Docker Compose (Production)

yaml
services:
  vllm:
    image: vllm/vllm-openai:v0.16.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ipc: host
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
    command: >
      --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
      --tensor-parallel-size 2
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --trust-remote-code
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s
console
$ docker compose up -d
$ docker compose logs -f vllm  # watch startup
Note
--ipc=host is required for multi-GPU tensor parallelism. Without it, NCCL communication will fail. On default Vultr instances, Docker requires sudo — add your user to the docker group with sudo usermod -aG docker $USER.

Directory Structure

This cookbook uses the following layout:

b200-cookbook/
├── .venv/                  # Python virtual environment
├── scripts/
│   ├── serve.sh            # Model serving presets
│   ├── bench.sh            # Single-model benchmark runner
│   └── bench_all.sh        # Full benchmark pipeline
├── results/                # Benchmark output (JSON + logs)
│   ├── nemotron-nano-fp8/
│   ├── nemotron-nano-nvfp4/
│   ├── nemotron-super-49b-fp8/
│   ├── nemotron-super-49b-bf16/
│   ├── minimax-m25/
│   ├── glm5/
│   ├── deepseek-v32/
│   └── dynamo-*/
└── docs/                   # This cookbook

Next Steps

Comments