Common issues encountered when running vLLM on NVIDIA HGX B200 GPUs and their solutions.
--trust-remote-code RequiredError:
ValueError: The model's config.json does not contain any of the supported model architecturesCause: Models with custom architectures (Nemotron Nano's nemotron_h, MiniMax M2.5's Lightning Attention, GLM-5's DSA) require custom Python code from the model repository.
Fix: Always include --trust-remote-code:
$ vllm serve <model> --trust-remote-code
All five models in this cookbook require this flag.
Output:
Using default MoE config. Performance might be sub-optimal!Cause: vLLM has no pre-tuned FP8 MoE kernel configuration for NVIDIA HGX B200 (sm_100 / Blackwell) yet. The default config works but may not be optimal.
Impact: Performance is still strong: this is an optimization opportunity, not a bug. Tuned configs for Blackwell are expected in future vLLM releases.
Action: Safe to ignore. No workaround needed.
Error:
torch.cuda.OutOfMemoryError: CUDA out of memoryFixes (in order of preference):
--tensor-parallel-size 4 # Instead of 2--max-model-len 16384 # Instead of 32768--gpu-memory-utilization 0.85 # Instead of 0.90# Halves model weight memory vs FP8
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 --tensor-parallel-size 1
--block-size 1Error:
ValueError: block_size must be 1 for MLA modelsCause: Multi-Latent Attention uses a compressed KV format that doesn't support the default block size.
Fix:
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 8 \
--quantization fp8 \
--block-size 1 \
--trust-remote-code
Error:
RuntimeError: Check failed: (args->n_group != 0) is false: n_group should not be zero for DeepSeekV3 routingCause: vLLM 0.16.0's fused MoE kernel assumes DeepSeek V3-style grouped routing (n_group > 0). MiniMax M2.5 uses a different routing strategy where n_group = 0, triggering this assertion.
Impact: MiniMax M2.5 cannot run on vLLM 0.16.0. The --enforce-eager flag does not help: the error is in the MoE kernel, not the compilation path. vLLM 0.16.0 is V1-only with no fallback engine.
Workaround: Use vLLM 0.12.0 for MiniMax M2.5. The benchmarks in this cookbook were run on vLLM 0.12.0 for MiniMax and vLLM 0.16.0 for the other four models.
vllm bench serve Wrong ArgumentsError:
error: unrecognized arguments: --input-lenCause: vLLM 0.12.0 uses different argument names than older versions.
Correct arguments:
$ vllm bench serve \
--random-input-len 2048 \ # NOT --input-len
--random-output-len 512 \ # NOT --output-len
--max-concurrency 64 \ # NOT --concurrency
--dataset-name random # Required for random workloads
Cause: Server hasn't finished loading the model. Large models (GLM-5, DeepSeek V3.2) can take 5-10 minutes to load on 8 GPUs.
Fix: Wait for the health endpoint before benchmarking:
$ while ! curl -s http://localhost:8000/v1/models > /dev/null; do sleep 5; done
$ echo "Server ready"
Possible causes:
Fixes:
# Increase KV cache budget
--gpu-memory-utilization 0.95
# Reduce per-request context
--max-model-len 16384
# Use FP8 KV cache
--kv-cache-dtype fp8In our benchmarks, all five models achieved 0 failed requests at all concurrency levels up to 1,024.
# Check all 8 GPUs are visible
$ nvidia-smi --query-gpu=index,name,memory.total --format=csv
Expected output shows 8x NVIDIA HGX B200 GPUs with 183,359 MiB each.
$ nvidia-smi topo -m
All GPU pairs should show NVSwitch connectivity (NV18). If any show PCIe instead, tensor parallelism performance will degrade significantly.
Ensure persistence mode is enabled to avoid GPU initialization delays:
# Check
$ nvidia-smi -q | grep "Persistence Mode"
# Enable (requires root)
$ sudo nvidia-smi -pm 1
Under sustained load, NVIDIA HGX B200 GPUs draw up to 1000W each. If the system can't sustain 8kW for all 8 GPUs, power throttling may occur:
# Check current power draw
$ nvidia-smi --query-gpu=index,power.draw,power.limit --format=csv
# Check for throttling reasons
$ nvidia-smi -q | grep -A 5 "Clocks Throttle Reasons"
Some systems only have python3 available, not python:
# Check
$ which python || echo "python not found"
$ which python3
# Fix (if python3 exists but python doesn't)
$ sudo ln -s /usr/bin/python3 /usr/bin/python
If vLLM commands aren't found:
# Activate the cookbook venv
$ source /home/athosg/cooking/b200-cookbook/.venv/bin/activate
# Verify
$ which vllm
$ vllm --version
vLLM logs useful diagnostics during serving. Key lines to watch:
# Model loading confirmation
Loading model weights took X.XX GB
# KV cache allocation
GPU blocks: XXXX, CPU blocks: XXXX
# Runtime metrics (logged every 10s)
Engine 000: Avg prompt throughput: X tok/s, Avg generation throughput: X tok/s,
Running: X reqs, Waiting: X reqs, GPU KV cache usage: X%, Prefix cache hit rate: X%Waiting: > 0 reqs: KV cache is full, requests are queuedGPU KV cache usage: > 95%: Close to capacityPrefix cache hit rate: < 50%: Prefix caching isn't effective (diverse prompts)