Focus Mode

Kernel Backends: FlashInfer and DeepGEMM

Updated on 11 March, 2026

vLLM on the NVIDIA HGX B200 uses specialized CUDA kernel backends for attention and GEMM operations. Understanding these backends helps diagnose startup issues, explain warmup times, and tune performance.

Overview

Backend	Purpose	Used By	JIT Compiled
FlashInfer	Attention (MHA, GQA, MLA)	All models	Yes (requires `nvcc`)
DeepGEMM	FP8 GEMM kernels	GLM-5, DeepSeek V3.2	Yes (~5 min first launch)
Fused MoE	Expert routing + GEMM	All MoE models	No (pre-compiled)
Triton	Custom GPU kernels	Various ops	Yes

FlashInfer

FlashInfer is vLLM's primary attention backend on the NVIDIA HGX B200. It JIT-compiles CUDA kernels for the specific GPU architecture (sm_100 / Blackwell).

Requirements

FlashInfer requires nvcc on PATH for JIT compilation:

                            console
                            
$ export PATH="/usr/local/cuda/bin:$PATH"

Without this, vLLM falls back to a slower attention implementation. Verify FlashInfer is detected:

                            console
                            
$ python3 -c "from vllm.platforms import current_platform; print('FlashInfer available:', current_platform.has_device_capability(80))"

How It Works

On first request, FlashInfer compiles attention kernels for the specific batch size and sequence length
Compiled kernels are cached in ~/.cache/vllm/torch_compile_cache/
Subsequent launches with similar configurations reuse cached kernels
CUDA graphs capture the compiled kernels for zero-overhead dispatch

Configuration

FlashInfer is used automatically when available. Key behaviors:

Attention Type	FlashInfer Support	Notes
Standard MHA/GQA	Full	Default backend
MLA (DeepSeek)	Full	Via `vllm::unified_mla_attention` op
Mamba/SSM (Nemotron)	N/A	SSM layers bypass attention entirely
Lightning Attention (MiniMax)	Partial	Custom ops for linear attention component
DSA (GLM-5)	Full	Sparse attention pattern

Troubleshooting FlashInfer

nvcc not found:

WARNING: FlashInfer not available, falling back to eager attention

Fix: export PATH="/usr/local/cuda/bin:$PATH"

Compilation cache stale after vLLM upgrade:

                            console
                            
$ rm -rf ~/.cache/vllm/torch_compile_cache/

The cache rebuilds automatically on next launch.

DeepGEMM

DeepGEMM provides JIT-compiled FP8 GEMM kernels using CUTLASS templates. It's used automatically for FP8 models that benefit from specialized matrix multiply operations.

Which Models Use DeepGEMM

Model	DeepGEMM Kernels	First-Launch Warmup
Nemotron Nano 30B	No	Minimal (~10s)
MiniMax M2.5	No	Minimal (~10s)
GLM-5 744B	Yes (~2,259 kernels)	~5 minutes
DeepSeek V3.2 685B	Yes (~1,827 kernels)	~5 minutes

First-Launch Behavior

On the first launch of a DeepGEMM-using model, you'll see repeated compilation messages:

Compiling DeepGEMM kernel 1/2259...
Compiling DeepGEMM kernel 2/2259...
...

This takes ~5 minutes for GLM-5 and DeepSeek V3.2. Compiled kernels are cached and reused on subsequent launches.

Cache Location

DeepGEMM kernels are cached per-model:

console

$ ls ~/.cache/deep_gemm/

CUTLASS Headers

DeepGEMM requires CUTLASS headers for compilation. If installed from source, you may need to symlink them:

                            console
                            
                        
$ ln -sf /path/to/deep_gemm/third-party/cutlass/include/cutlass /path/to/deep_gemm/deep_gemm/include/
$ ln -sf /path/to/deep_gemm/third-party/cutlass/include/cute /path/to/deep_gemm/deep_gemm/include/

The pip-installed version handles this automatically.

Disabling DeepGEMM

If DeepGEMM causes issues:

                            console
                            
$ export VLLM_USE_DEEP_GEMM=0

This falls back to standard FP8 GEMM kernels. Performance may decrease for models that benefit from DeepGEMM's optimized paths.

Fused MoE Kernels

vLLM uses fused Mixture-of-Experts kernels that combine expert routing and GEMM into a single operation.

NVIDIA HGX B200 Status

No pre-tuned FP8 MoE kernel configuration exists for the NVIDIA HGX B200 (sm_100) yet. vLLM logs this warning:

Using default MoE config. Performance might be sub-optimal!

This is safe to ignore: performance is still strong with the default config. Tuned configs for Blackwell are expected in future vLLM releases.

Known Incompatibility

vLLM 0.16.0's fused MoE kernel assumes DeepSeek V3-style grouped routing (n_group > 0). Models with n_group = 0 (including MiniMax M2.5) crash with:

RuntimeError: Check failed: (args->n_group != 0) is false: n_group should not be zero for DeepSeekV3 routing

Use vLLM 0.12.0 for affected models. See Troubleshooting.

torch.compile and CUDA Graphs

vLLM 0.16.0 uses torch.compile with CUDA graphs for optimized execution:

torch.compile: Fuses operations and optimizes the compute graph at startup
CUDA graphs: Captures a sequence of GPU operations and replays them with near-zero CPU overhead

CUDA Graph Capture Sizes

vLLM pre-captures CUDA graphs for common batch sizes (1, 2, 4, 8, 16, ... up to 512). Requests are padded to the nearest capture size. This means:

First requests at each batch size trigger a CUDA graph capture (~1-2s delay)
Subsequent requests at that size execute with minimal overhead
The --max-num-seqs flag limits the maximum capture size

Disabling Compilation

For debugging or if torch.compile causes issues:

                            console
                            
$ vllm serve <model> --enforce-eager

This disables both torch.compile and CUDA graphs. Expect 10-30% lower throughput but faster startup.

Environment Variable Summary

Variable	Default	Purpose
`PATH`	System	Must include `/usr/local/cuda/bin` for FlashInfer JIT
`LD_LIBRARY_PATH`	System	Must include nvidia pip package lib dirs
`VLLM_USE_DEEP_GEMM`	1	Enable/disable DeepGEMM FP8 kernels