Troubleshooting

Updated on 11 March, 2026

Common issues encountered when running vLLM on NVIDIA HGX B200 GPUs and their solutions.


Model Loading Issues

--trust-remote-code Required

Error:

ValueError: The model's config.json does not contain any of the supported model architectures

Cause: Models with custom architectures (Nemotron Nano's nemotron_h, MiniMax M2.5's Lightning Attention, GLM-5's DSA) require custom Python code from the model repository.

Fix: Always include --trust-remote-code:

console
$ vllm serve <model> --trust-remote-code

All five models in this cookbook require this flag.

MoE Config Warning

Output:

Using default MoE config. Performance might be sub-optimal!

Cause: vLLM has no pre-tuned FP8 MoE kernel configuration for NVIDIA HGX B200 (sm_100 / Blackwell) yet. The default config works but may not be optimal.

Impact: Performance is still strong: this is an optimization opportunity, not a bug. Tuned configs for Blackwell are expected in future vLLM releases.

Action: Safe to ignore. No workaround needed.

Out of Memory During Model Loading

Error:

torch.cuda.OutOfMemoryError: CUDA out of memory

Fixes (in order of preference):

  1. Increase tensor parallelism:
--tensor-parallel-size 4  # Instead of 2
  1. Reduce context length:
--max-model-len 16384  # Instead of 32768
  1. Lower GPU memory utilization:
--gpu-memory-utilization 0.85  # Instead of 0.90
  1. Use NVFP4 quantization (NVIDIA HGX B200 only):
console
# Halves model weight memory vs FP8
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 --tensor-parallel-size 1

DeepSeek V3.2 Requires --block-size 1

Error:

ValueError: block_size must be 1 for MLA models

Cause: Multi-Latent Attention uses a compressed KV format that doesn't support the default block size.

Fix:

console
$ vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --block-size 1 \
  --trust-remote-code

MiniMax M2.5 Crashes on vLLM 0.16.0

Error:

RuntimeError: Check failed: (args->n_group != 0) is false: n_group should not be zero for DeepSeekV3 routing

Cause: vLLM 0.16.0's fused MoE kernel assumes DeepSeek V3-style grouped routing (n_group > 0). MiniMax M2.5 uses a different routing strategy where n_group = 0, triggering this assertion.

Impact: MiniMax M2.5 cannot run on vLLM 0.16.0. The --enforce-eager flag does not help: the error is in the MoE kernel, not the compilation path. vLLM 0.16.0 is V1-only with no fallback engine.

Workaround: Use vLLM 0.12.0 for MiniMax M2.5. The benchmarks in this cookbook were run on vLLM 0.12.0 for MiniMax and vLLM 0.16.0 for the other four models.

Benchmark Issues

vllm bench serve Wrong Arguments

Error:

error: unrecognized arguments: --input-len

Cause: vLLM 0.12.0 uses different argument names than older versions.

Correct arguments:

console
$ vllm bench serve \
  --random-input-len 2048 \    # NOT --input-len
  --random-output-len 512 \    # NOT --output-len
  --max-concurrency 64 \       # NOT --concurrency
  --dataset-name random        # Required for random workloads

Connection Refused During Benchmark

Cause: Server hasn't finished loading the model. Large models (GLM-5, DeepSeek V3.2) can take 5-10 minutes to load on 8 GPUs.

Fix: Wait for the health endpoint before benchmarking:

console
$ while ! curl -s http://localhost:8000/v1/models > /dev/null; do sleep 5; done
$ echo "Server ready"

Failed Requests at High Concurrency

Possible causes:

  • KV cache exhaustion: too many concurrent requests
  • Request timeout

Fixes:

# Increase KV cache budget
--gpu-memory-utilization 0.95

# Reduce per-request context
--max-model-len 16384

# Use FP8 KV cache
--kv-cache-dtype fp8

In our benchmarks, all five models achieved 0 failed requests at all concurrency levels up to 1,024.

GPU Issues

Verify GPU Visibility

console
# Check all 8 GPUs are visible
$ nvidia-smi --query-gpu=index,name,memory.total --format=csv

Expected output shows 8x NVIDIA HGX B200 GPUs with 183,359 MiB each.

console
$ nvidia-smi topo -m

All GPU pairs should show NVSwitch connectivity (NV18). If any show PCIe instead, tensor parallelism performance will degrade significantly.

Persistence Mode

Ensure persistence mode is enabled to avoid GPU initialization delays:

console
# Check
$ nvidia-smi -q | grep "Persistence Mode"

# Enable (requires root)
$ sudo nvidia-smi -pm 1

GPU Power Throttling

Under sustained load, NVIDIA HGX B200 GPUs draw up to 1000W each. If the system can't sustain 8kW for all 8 GPUs, power throttling may occur:

console
# Check current power draw
$ nvidia-smi --query-gpu=index,power.draw,power.limit --format=csv

# Check for throttling reasons
$ nvidia-smi -q | grep -A 5 "Clocks Throttle Reasons"

Environment Issues

Some systems only have python3 available, not python:

console
# Check
$ which python || echo "python not found"
$ which python3

# Fix (if python3 exists but python doesn't)
$ sudo ln -s /usr/bin/python3 /usr/bin/python

Virtual Environment

If vLLM commands aren't found:

console
# Activate the cookbook venv
$ source /home/athosg/cooking/b200-cookbook/.venv/bin/activate

# Verify
$ which vllm
$ vllm --version

vLLM Server Logs

vLLM logs useful diagnostics during serving. Key lines to watch:

# Model loading confirmation
Loading model weights took X.XX GB

# KV cache allocation
GPU blocks: XXXX, CPU blocks: XXXX

# Runtime metrics (logged every 10s)
Engine 000: Avg prompt throughput: X tok/s, Avg generation throughput: X tok/s,
  Running: X reqs, Waiting: X reqs, GPU KV cache usage: X%, Prefix cache hit rate: X%
Warning
Signs of trouble:
  • Waiting: > 0 reqs: KV cache is full, requests are queued
  • GPU KV cache usage: > 95%: Close to capacity
  • Prefix cache hit rate: < 50%: Prefix caching isn't effective (diverse prompts)

Comments