Focus Mode

Environment Setup

Updated on 11 March, 2026

Set up an NVIDIA HGX B200 instance for LLM inference with vLLM.

Prerequisites

NVIDIA HGX B200 instance (8x NVIDIA HGX B200 GPUs)
Ubuntu 22.04+ with NVIDIA driver 580+
Python 3.12

Verify GPU Hardware

console

$ nvidia-smi

Expected output (abbreviated):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08    Driver Version: 580.105.08    CUDA Version: 13.0               |
|----------+------------------------+----------------------+
| GPU Name | Memory-Usage           | GPU-Util             |
|----------+------------------------+----------------------+
|   0  B200 |   4MiB / 183359MiB    |      0%              |
|   1  B200 |   4MiB / 183359MiB    |      0%              |
|   2  B200 |   4MiB / 183359MiB    |      0%              |
|   3  B200 |   4MiB / 183359MiB    |      0%              |
|   4  B200 |   4MiB / 183359MiB    |      0%              |
|   5  B200 |   4MiB / 183359MiB    |      0%              |
|   6  B200 |   4MiB / 183359MiB    |      0%              |
|   7  B200 |   4MiB / 183359MiB    |      0%              |
+----------+------------------------+----------------------+

Confirm all 8 GPUs are visible and persistence mode is on:

                            console
                            
$ nvidia-smi --query-gpu=index,name,memory.total,persistence_mode --format=csv

Verify NVLink Topology

The NVIDIA HGX B200 uses NVSwitch 5.0 for all-to-all GPU communication. Verify with:

                            console
                            
$ nvidia-smi topo -m

All GPU pairs should show NV connections (NVLink via NVSwitch), not PHB or SYS.

Install vLLM

We recommend using uv for fast dependency resolution:

                            console
                            
                        
# Install uv
$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ export PATH="$HOME/.local/bin:$PATH"

# Create project directory and virtual environment
$ mkdir -p ~/b200-cookbook && cd ~/b200-cookbook
$ uv venv --python 3.12 .venv
$ source .venv/bin/activate

# Install dependencies
$ uv pip install vllm huggingface-hub

# Set required environment variables
$ export PATH="/usr/local/cuda/bin:$PATH"
$ NVIDIA_LIB_DIR=".venv/lib/python3.12/site-packages/nvidia"
$ export LD_LIBRARY_PATH="$(find $NVIDIA_LIB_DIR -name 'lib' -type d | tr '\n' ':')$LD_LIBRARY_PATH"

Note

The PATH addition is needed for FlashInfer's JIT compilation (requires nvcc). The LD_LIBRARY_PATH addition resolves CUDA runtime library mismatches between the PyPI wheel and the system CUDA installation. See Troubleshooting for details.

Verify the installation:

                            console
                            
                        
$ python3 -c "import vllm; print(f'vLLM {vllm.__version__}')"
$ python3 -c "import torch; print(f'PyTorch {torch.__version__}, CUDA {torch.version.cuda}, GPUs: {torch.cuda.device_count()}')"

Expected output:

vLLM 0.16.0
PyTorch 2.9.1+cu130, CUDA 13.0, GPUs: 8

GPU Compute Capability

NVIDIA HGX B200 GPUs use compute capability sm_100 (Blackwell architecture). Verify:

                            console
                            
$ python3 -c "import torch; print(torch.cuda.get_device_capability())"

Expected: (10, 0)

Note

vLLM 0.16.0 is recommended for NVIDIA HGX B200. It adds support for GLM-5's DSA architecture, FlashInfer MLA backend for DeepSeek, and TRT-LLM ragged prefill optimizations. Requires PyTorch 2.9.1+cu130. If you encounter compatibility issues, see Troubleshooting.

Note

Version compatibility: MiniMax M2.5 requires vLLM 0.12.0 — its MoE routing (n_group=0) is incompatible with the fused kernel in 0.16.0. All other models (Nemotron Nano, Nemotron Super, GLM-5, DeepSeek V3.2) use vLLM 0.16.0. See Troubleshooting: MiniMax M2.5 for details.

Optional: NVIDIA Dynamo

For disaggregated serving experiments (prefill/decode separation):

                            console
                            
$ uv pip install "ai-dynamo[vllm]"

See the Dynamo chapter for setup and benchmarks.

Docker Setup

Docker avoids LD_LIBRARY_PATH issues and pins the exact vLLM version. This is the recommended approach for production.

                            console
                            
$ docker pull vllm/vllm-openai:v0.16.0

Quick Test

                            console
                            
                        
$ docker run --rm --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.16.0 \
  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Docker Flags Explained

Flag	Purpose
`--gpus all`	Expose all 8 NVIDIA HGX B200 GPUs to the container
`--ipc=host`	Share host memory for tensor parallel communication
`-v ~/.cache/huggingface:...`	Persist downloaded model weights across container restarts
`--rm`	Remove container on exit (omit for production)

Docker Compose (Production)

                            yaml
                            
                        
services:
  vllm:
    image: vllm/vllm-openai:v0.16.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ipc: host
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
    command: >
      --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
      --tensor-parallel-size 2
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --trust-remote-code
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s

                            console
                            
                        
$ docker compose up -d
$ docker compose logs -f vllm  # watch startup

Note

--ipc=host is required for multi-GPU tensor parallelism. Without it, NCCL communication will fail. On default Vultr instances, Docker requires sudo — add your user to the docker group with sudo usermod -aG docker $USER.

Directory Structure

This cookbook uses the following layout:

b200-cookbook/
├── .venv/                  # Python virtual environment
├── scripts/
│   ├── serve.sh            # Model serving presets
│   ├── bench.sh            # Single-model benchmark runner
│   └── bench_all.sh        # Full benchmark pipeline
├── results/                # Benchmark output (JSON + logs)
│   ├── nemotron-nano-fp8/
│   ├── nemotron-nano-nvfp4/
│   ├── nemotron-super-49b-fp8/
│   ├── nemotron-super-49b-bf16/
│   ├── minimax-m25/
│   ├── glm5/
│   ├── deepseek-v32/
│   └── dynamo-*/
└── docs/                   # This cookbook

Next Steps

Deploy your first model: Start with Nemotron Nano 30B
Benchmark methodology: How we test

Environment Setup

Prerequisites

Verify GPU Hardware

Verify NVLink Topology

Install vLLM

GPU Compute Capability

Optional: NVIDIA Dynamo

Docker Setup

Quick Test

Docker Flags Explained

Docker Compose (Production)

Directory Structure

Next Steps

Comments

Products

Features

Solutions

Marketplace

Resources

Company