Kernel Backends: FlashInfer and DeepGEMM

Updated on 11 March, 2026

vLLM on the NVIDIA HGX B200 uses specialized CUDA kernel backends for attention and GEMM operations. Understanding these backends helps diagnose startup issues, explain warmup times, and tune performance.


Overview

Backend Purpose Used By JIT Compiled
FlashInfer Attention (MHA, GQA, MLA) All models Yes (requires nvcc)
DeepGEMM FP8 GEMM kernels GLM-5, DeepSeek V3.2 Yes (~5 min first launch)
Fused MoE Expert routing + GEMM All MoE models No (pre-compiled)
Triton Custom GPU kernels Various ops Yes

FlashInfer

FlashInfer is vLLM's primary attention backend on the NVIDIA HGX B200. It JIT-compiles CUDA kernels for the specific GPU architecture (sm_100 / Blackwell).

Requirements

FlashInfer requires nvcc on PATH for JIT compilation:

console
$ export PATH="/usr/local/cuda/bin:$PATH"

Without this, vLLM falls back to a slower attention implementation. Verify FlashInfer is detected:

console
$ python3 -c "from vllm.platforms import current_platform; print('FlashInfer available:', current_platform.has_device_capability(80))"

How It Works

  1. On first request, FlashInfer compiles attention kernels for the specific batch size and sequence length
  2. Compiled kernels are cached in ~/.cache/vllm/torch_compile_cache/
  3. Subsequent launches with similar configurations reuse cached kernels
  4. CUDA graphs capture the compiled kernels for zero-overhead dispatch

Configuration

FlashInfer is used automatically when available. Key behaviors:

Attention Type FlashInfer Support Notes
Standard MHA/GQA Full Default backend
MLA (DeepSeek) Full Via vllm::unified_mla_attention op
Mamba/SSM (Nemotron) N/A SSM layers bypass attention entirely
Lightning Attention (MiniMax) Partial Custom ops for linear attention component
DSA (GLM-5) Full Sparse attention pattern

Troubleshooting FlashInfer

nvcc not found:

WARNING: FlashInfer not available, falling back to eager attention

Fix: export PATH="/usr/local/cuda/bin:$PATH"

Compilation cache stale after vLLM upgrade:

console
$ rm -rf ~/.cache/vllm/torch_compile_cache/

The cache rebuilds automatically on next launch.

DeepGEMM

DeepGEMM provides JIT-compiled FP8 GEMM kernels using CUTLASS templates. It's used automatically for FP8 models that benefit from specialized matrix multiply operations.

Which Models Use DeepGEMM

Model DeepGEMM Kernels First-Launch Warmup
Nemotron Nano 30B No Minimal (~10s)
MiniMax M2.5 No Minimal (~10s)
GLM-5 744B Yes (~2,259 kernels) ~5 minutes
DeepSeek V3.2 685B Yes (~1,827 kernels) ~5 minutes

First-Launch Behavior

On the first launch of a DeepGEMM-using model, you'll see repeated compilation messages:

Compiling DeepGEMM kernel 1/2259...
Compiling DeepGEMM kernel 2/2259...
...

This takes ~5 minutes for GLM-5 and DeepSeek V3.2. Compiled kernels are cached and reused on subsequent launches.

Cache Location

DeepGEMM kernels are cached per-model:

console
$ ls ~/.cache/deep_gemm/

CUTLASS Headers

DeepGEMM requires CUTLASS headers for compilation. If installed from source, you may need to symlink them:

console
$ ln -sf /path/to/deep_gemm/third-party/cutlass/include/cutlass /path/to/deep_gemm/deep_gemm/include/
$ ln -sf /path/to/deep_gemm/third-party/cutlass/include/cute /path/to/deep_gemm/deep_gemm/include/

The pip-installed version handles this automatically.

Disabling DeepGEMM

If DeepGEMM causes issues:

console
$ export VLLM_USE_DEEP_GEMM=0

This falls back to standard FP8 GEMM kernels. Performance may decrease for models that benefit from DeepGEMM's optimized paths.

Fused MoE Kernels

vLLM uses fused Mixture-of-Experts kernels that combine expert routing and GEMM into a single operation.

NVIDIA HGX B200 Status

No pre-tuned FP8 MoE kernel configuration exists for the NVIDIA HGX B200 (sm_100) yet. vLLM logs this warning:

Using default MoE config. Performance might be sub-optimal!

This is safe to ignore: performance is still strong with the default config. Tuned configs for Blackwell are expected in future vLLM releases.

Known Incompatibility

vLLM 0.16.0's fused MoE kernel assumes DeepSeek V3-style grouped routing (n_group > 0). Models with n_group = 0 (including MiniMax M2.5) crash with:

RuntimeError: Check failed: (args->n_group != 0) is false: n_group should not be zero for DeepSeekV3 routing

Use vLLM 0.12.0 for affected models. See Troubleshooting.

torch.compile and CUDA Graphs

vLLM 0.16.0 uses torch.compile with CUDA graphs for optimized execution:

  1. torch.compile: Fuses operations and optimizes the compute graph at startup
  2. CUDA graphs: Captures a sequence of GPU operations and replays them with near-zero CPU overhead

CUDA Graph Capture Sizes

vLLM pre-captures CUDA graphs for common batch sizes (1, 2, 4, 8, 16, ... up to 512). Requests are padded to the nearest capture size. This means:

  • First requests at each batch size trigger a CUDA graph capture (~1-2s delay)
  • Subsequent requests at that size execute with minimal overhead
  • The --max-num-seqs flag limits the maximum capture size

Disabling Compilation

For debugging or if torch.compile causes issues:

console
$ vllm serve <model> --enforce-eager

This disables both torch.compile and CUDA graphs. Expect 10-30% lower throughput but faster startup.

Environment Variable Summary

Variable Default Purpose
PATH System Must include /usr/local/cuda/bin for FlashInfer JIT
LD_LIBRARY_PATH System Must include nvidia pip package lib dirs
VLLM_USE_DEEP_GEMM 1 Enable/disable DeepGEMM FP8 kernels

Comments