Guidelines for deploying vLLM on NVIDIA HGX B200 instances in production.
Use all 8 GPUs for one model instance. Best for large models (GLM-5, DeepSeek V3.2) that require the full node's VRAM.
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--quantization fp8 \
--block-size 1
Docker Compose:
services:
vllm:
image: vllm/vllm-openai:v0.16.0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN}
command: >
--model deepseek-ai/DeepSeek-V3-0324
--tensor-parallel-size 8
--max-model-len 32768
--gpu-memory-utilization 0.95
--trust-remote-code
--quantization fp8
--block-size 1
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 600s
For models that don't need 8 GPUs, run multiple independent instances to maximize node utilization:
# 4 instances of Nemotron Nano (TP=2 each)
$ for i in 0 1 2 3; do
GPU_START=$((i * 2))
GPU_END=$((GPU_START + 1))
PORT=$((8000 + i))
CUDA_VISIBLE_DEVICES=${GPU_START},${GPU_END} vllm serve \
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code \
--port $PORT &
done
This achieves ~75,000 tok/s aggregate output throughput.
Run different models on different GPU subsets:
# GPUs 0-3: MiniMax M2.5 (TP=4)
$ CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve MiniMaxAI/MiniMax-M2.5 \
--tensor-parallel-size 4 --port 8000 --trust-remote-code &
# GPUs 4-5: Nemotron Nano (TP=2)
$ CUDA_VISIBLE_DEVICES=4,5 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--tensor-parallel-size 2 --port 8001 --trust-remote-code &
# GPUs 6-7: Nemotron Nano NVFP4 (TP=1 each)
$ CUDA_VISIBLE_DEVICES=6 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--tensor-parallel-size 1 --port 8002 --trust-remote-code &
$ CUDA_VISIBLE_DEVICES=7 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--tensor-parallel-size 1 --port 8003 --trust-remote-code &
When all instances serve the same model, simple round-robin works:
upstream vllm_pool {
server 127.0.0.1:8000;
server 127.0.0.1:8001;
server 127.0.0.1:8002;
server 127.0.0.1:8003;
}
server {
listen 80;
location /v1/ {
proxy_pass http://vllm_pool;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
}
}
http://localhost:8000 for MiniMax, http://localhost:8001 for Nemotron).
For smarter routing (e.g., KV-cache-aware routing), see NVIDIA Dynamo.
# Simple health check
$ curl -s http://localhost:8000/v1/models | jq .
# For load balancers
$ curl -sf http://localhost:8000/health || echo "unhealthy"
# Continuous GPU monitoring
$ watch -n 5 nvidia-smi --query-gpu=index,temperature.gpu,power.draw,memory.used,utilization.gpu \
--format=csv
Key thresholds:
vLLM exposes Prometheus-compatible metrics:
$ curl -s http://localhost:8000/metrics | head -20
Key metrics:
vllm:num_requests_running: Active requestsvllm:num_requests_waiting: Queued requests (should be near 0)vllm:gpu_cache_usage_perc: KV cache utilizationvllm:avg_generation_throughput_toks_per_s: Output throughput| Model | GPUs | Peak tok/s |
|---|---|---|
| Nemotron Nano NVFP4 (×8 instances) | 8 | ~124,000 |
| Nemotron Nano FP8 (×4 instances) | 8 | ~75,000 |
| Nemotron Nano FP8 (×1 instance) | 2 | 18,829 |
| MiniMax M2.5 (×2 instances) | 8 | ~17,600 |
| MiniMax M2.5 (×1 instance) | 4 | 8,838 |
| Nemotron Super 49B FP8 (×8 instances) | 8 | ~30,500 |
| Nemotron Super 49B FP8 (×1 instance) | 1 | 3,816 |
| GLM-5 744B | 8 | 2,132 |
| DeepSeek V3.2 685B | 8 | 4,370 |
Model loading times vary significantly:
| Model | Approximate Load Time |
|---|---|
| Nemotron Nano 30B (FP8, TP=2) | ~1-2 minutes |
| Nemotron Super 49B (FP8, TP=1) | ~1-2 minutes |
| MiniMax M2.5 (FP8, TP=4) | ~3-5 minutes |
| GLM-5 744B (FP8, TP=8) | ~5-10 minutes |
| DeepSeek V3.2 (FP8, TP=8) | ~5-10 minutes |
Plan for these startup times in your deployment automation. Use health check loops before routing traffic.
Models are large. Plan your disk budget:
| Model | Disk Usage (HuggingFace Cache) |
|---|---|
| Nemotron Nano 30B | ~30 GB |
| Nemotron Super 49B | ~49 GB |
| MiniMax M2.5 | ~230 GB |
| GLM-5 744B | ~705 GB |
| DeepSeek V3.2 685B | ~642 GB |
| Total (all 5) | ~1,656 GB |
Pre-download models before deploying:
$ huggingface-cli download <model_id>
vLLM handles in-flight requests during shutdown. Send SIGTERM and wait:
$ kill -TERM $(pgrep -f "vllm serve")
# Wait for in-flight requests to complete (up to 60s default)
Redirect vLLM output to structured logs:
$ vllm serve <model> --trust-remote-code 2>&1 | tee /var/log/vllm/serving.log
Key log patterns to monitor:
Failed requests: > 0: Investigate immediatelyWaiting: > 0 reqs: KV cache at capacity, consider scalingGPU KV cache usage: > 90%: Approaching memory limit