Production Deployment

Updated on 11 March, 2026

Guidelines for deploying vLLM on NVIDIA HGX B200 instances in production.


Deployment Strategies

Single Large Model (TP=8)

Use all 8 GPUs for one model instance. Best for large models (GLM-5, DeepSeek V3.2) that require the full node's VRAM.

console
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --quantization fp8 \
  --block-size 1

Docker Compose:

yaml
services:
  vllm:
    image: vllm/vllm-openai:v0.16.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ipc: host
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
    command: >
      --model deepseek-ai/DeepSeek-V3-0324
      --tensor-parallel-size 8
      --max-model-len 32768
      --gpu-memory-utilization 0.95
      --trust-remote-code
      --quantization fp8
      --block-size 1
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 600s

Multi-Instance (Smaller Models)

For models that don't need 8 GPUs, run multiple independent instances to maximize node utilization:

console
# 4 instances of Nemotron Nano (TP=2 each)
$ for i in 0 1 2 3; do
  GPU_START=$((i * 2))
  GPU_END=$((GPU_START + 1))
  PORT=$((8000 + i))
  CUDA_VISIBLE_DEVICES=${GPU_START},${GPU_END} vllm serve \
    nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --trust-remote-code \
    --port $PORT &
done

This achieves ~75,000 tok/s aggregate output throughput.

Mixed Model Deployment

Run different models on different GPU subsets:

console
# GPUs 0-3: MiniMax M2.5 (TP=4)
$ CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 --port 8000 --trust-remote-code &

# GPUs 4-5: Nemotron Nano (TP=2)
$ CUDA_VISIBLE_DEVICES=4,5 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
  --tensor-parallel-size 2 --port 8001 --trust-remote-code &

# GPUs 6-7: Nemotron Nano NVFP4 (TP=1 each)
$ CUDA_VISIBLE_DEVICES=6 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 --port 8002 --trust-remote-code &

$ CUDA_VISIBLE_DEVICES=7 vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 --port 8003 --trust-remote-code &

Load Balancing

Same-Model Multi-Instance

When all instances serve the same model, simple round-robin works:

nginx
upstream vllm_pool {
    server 127.0.0.1:8000;
    server 127.0.0.1:8001;
    server 127.0.0.1:8002;
    server 127.0.0.1:8003;
}

server {
    listen 80;
    location /v1/ {
        proxy_pass http://vllm_pool;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_read_timeout 300s;
    }
}

Mixed-Model Deployment

Warning
Reverse proxy load balancing is not suitable for mixed-model deployments. A round-robin pool across ports serving different models will misroute requests and cause errors. For mixed-model setups, clients should target each model's port directly (e.g., http://localhost:8000 for MiniMax, http://localhost:8001 for Nemotron).

For smarter routing (e.g., KV-cache-aware routing), see NVIDIA Dynamo.

Health Monitoring

Health Check Endpoint

console
# Simple health check
$ curl -s http://localhost:8000/v1/models | jq .

# For load balancers
$ curl -sf http://localhost:8000/health || echo "unhealthy"

GPU Monitoring

console
# Continuous GPU monitoring
$ watch -n 5 nvidia-smi --query-gpu=index,temperature.gpu,power.draw,memory.used,utilization.gpu \
  --format=csv

Key thresholds:

  • Temperature: Alert at >85°C junction
  • Power: Normal is 200-700W per GPU under inference load
  • VRAM: Alert if used/total > 95% consistently
  • GPU Utilization: Low utilization during active serving suggests a bottleneck

vLLM Metrics

vLLM exposes Prometheus-compatible metrics:

console
$ curl -s http://localhost:8000/metrics | head -20

Key metrics:

  • vllm:num_requests_running: Active requests
  • vllm:num_requests_waiting: Queued requests (should be near 0)
  • vllm:gpu_cache_usage_perc: KV cache utilization
  • vllm:avg_generation_throughput_toks_per_s: Output throughput

Throughput by Model

Model GPUs Peak tok/s
Nemotron Nano NVFP4 (×8 instances) 8 ~124,000
Nemotron Nano FP8 (×4 instances) 8 ~75,000
Nemotron Nano FP8 (×1 instance) 2 18,829
MiniMax M2.5 (×2 instances) 8 ~17,600
MiniMax M2.5 (×1 instance) 4 8,838
Nemotron Super 49B FP8 (×8 instances) 8 ~30,500
Nemotron Super 49B FP8 (×1 instance) 1 3,816
GLM-5 744B 8 2,132
DeepSeek V3.2 685B 8 4,370

Operational Considerations

Startup Time

Model loading times vary significantly:

Model Approximate Load Time
Nemotron Nano 30B (FP8, TP=2) ~1-2 minutes
Nemotron Super 49B (FP8, TP=1) ~1-2 minutes
MiniMax M2.5 (FP8, TP=4) ~3-5 minutes
GLM-5 744B (FP8, TP=8) ~5-10 minutes
DeepSeek V3.2 (FP8, TP=8) ~5-10 minutes

Plan for these startup times in your deployment automation. Use health check loops before routing traffic.

Disk Space

Models are large. Plan your disk budget:

Model Disk Usage (HuggingFace Cache)
Nemotron Nano 30B ~30 GB
Nemotron Super 49B ~49 GB
MiniMax M2.5 ~230 GB
GLM-5 744B ~705 GB
DeepSeek V3.2 685B ~642 GB
Total (all 5) ~1,656 GB
Warning
All five models total ~1.7 TB. A typical NVIDIA HGX B200 instance has ~1.7 TB disk. You cannot have all models cached simultaneously. Use a sequential download/benchmark/delete workflow, or pre-download only the models you plan to serve.

Pre-download models before deploying:

console
$ huggingface-cli download <model_id>

Graceful Shutdown

vLLM handles in-flight requests during shutdown. Send SIGTERM and wait:

console
$ kill -TERM $(pgrep -f "vllm serve")
# Wait for in-flight requests to complete (up to 60s default)

Logging

Redirect vLLM output to structured logs:

console
$ vllm serve <model> --trust-remote-code 2>&1 | tee /var/log/vllm/serving.log

Key log patterns to monitor:

  • Failed requests: > 0: Investigate immediately
  • Waiting: > 0 reqs: KV cache at capacity, consider scaling
  • GPU KV cache usage: > 90%: Approaching memory limit

Comments