vLLM on the NVIDIA HGX B200 uses specialized CUDA kernel backends for attention and GEMM operations. Understanding these backends helps diagnose startup issues, explain warmup times, and tune performance.
| Backend | Purpose | Used By | JIT Compiled |
|---|---|---|---|
| FlashInfer | Attention (MHA, GQA, MLA) | All models | Yes (requires nvcc) |
| DeepGEMM | FP8 GEMM kernels | GLM-5, DeepSeek V3.2 | Yes (~5 min first launch) |
| Fused MoE | Expert routing + GEMM | All MoE models | No (pre-compiled) |
| Triton | Custom GPU kernels | Various ops | Yes |
FlashInfer is vLLM's primary attention backend on the NVIDIA HGX B200. It JIT-compiles CUDA kernels for the specific GPU architecture (sm_100 / Blackwell).
FlashInfer requires nvcc on PATH for JIT compilation:
$ export PATH="/usr/local/cuda/bin:$PATH"
Without this, vLLM falls back to a slower attention implementation. Verify FlashInfer is detected:
$ python3 -c "from vllm.platforms import current_platform; print('FlashInfer available:', current_platform.has_device_capability(80))"
~/.cache/vllm/torch_compile_cache/FlashInfer is used automatically when available. Key behaviors:
| Attention Type | FlashInfer Support | Notes |
|---|---|---|
| Standard MHA/GQA | Full | Default backend |
| MLA (DeepSeek) | Full | Via vllm::unified_mla_attention op |
| Mamba/SSM (Nemotron) | N/A | SSM layers bypass attention entirely |
| Lightning Attention (MiniMax) | Partial | Custom ops for linear attention component |
| DSA (GLM-5) | Full | Sparse attention pattern |
nvcc not found:
WARNING: FlashInfer not available, falling back to eager attentionFix: export PATH="/usr/local/cuda/bin:$PATH"
Compilation cache stale after vLLM upgrade:
$ rm -rf ~/.cache/vllm/torch_compile_cache/
The cache rebuilds automatically on next launch.
DeepGEMM provides JIT-compiled FP8 GEMM kernels using CUTLASS templates. It's used automatically for FP8 models that benefit from specialized matrix multiply operations.
| Model | DeepGEMM Kernels | First-Launch Warmup |
|---|---|---|
| Nemotron Nano 30B | No | Minimal (~10s) |
| MiniMax M2.5 | No | Minimal (~10s) |
| GLM-5 744B | Yes (~2,259 kernels) | ~5 minutes |
| DeepSeek V3.2 685B | Yes (~1,827 kernels) | ~5 minutes |
On the first launch of a DeepGEMM-using model, you'll see repeated compilation messages:
Compiling DeepGEMM kernel 1/2259...
Compiling DeepGEMM kernel 2/2259...
...This takes ~5 minutes for GLM-5 and DeepSeek V3.2. Compiled kernels are cached and reused on subsequent launches.
DeepGEMM kernels are cached per-model:
$ ls ~/.cache/deep_gemm/
DeepGEMM requires CUTLASS headers for compilation. If installed from source, you may need to symlink them:
$ ln -sf /path/to/deep_gemm/third-party/cutlass/include/cutlass /path/to/deep_gemm/deep_gemm/include/
$ ln -sf /path/to/deep_gemm/third-party/cutlass/include/cute /path/to/deep_gemm/deep_gemm/include/
The pip-installed version handles this automatically.
If DeepGEMM causes issues:
$ export VLLM_USE_DEEP_GEMM=0
This falls back to standard FP8 GEMM kernels. Performance may decrease for models that benefit from DeepGEMM's optimized paths.
vLLM uses fused Mixture-of-Experts kernels that combine expert routing and GEMM into a single operation.
No pre-tuned FP8 MoE kernel configuration exists for the NVIDIA HGX B200 (sm_100) yet. vLLM logs this warning:
Using default MoE config. Performance might be sub-optimal!This is safe to ignore: performance is still strong with the default config. Tuned configs for Blackwell are expected in future vLLM releases.
vLLM 0.16.0's fused MoE kernel assumes DeepSeek V3-style grouped routing (n_group > 0). Models with n_group = 0 (including MiniMax M2.5) crash with:
RuntimeError: Check failed: (args->n_group != 0) is false: n_group should not be zero for DeepSeekV3 routingUse vLLM 0.12.0 for affected models. See Troubleshooting.
vLLM 0.16.0 uses torch.compile with CUDA graphs for optimized execution:
vLLM pre-captures CUDA graphs for common batch sizes (1, 2, 4, 8, 16, ... up to 512). Requests are padded to the nearest capture size. This means:
--max-num-seqs flag limits the maximum capture sizeFor debugging or if torch.compile causes issues:
$ vllm serve <model> --enforce-eager
This disables both torch.compile and CUDA graphs. Expect 10-30% lower throughput but faster startup.
| Variable | Default | Purpose |
|---|---|---|
PATH |
System | Must include /usr/local/cuda/bin for FlashInfer JIT |
LD_LIBRARY_PATH |
System | Must include nvidia pip package lib dirs |
VLLM_USE_DEEP_GEMM |
1 | Enable/disable DeepGEMM FP8 kernels |