Extend effective memory by offloading KV cache to CPU memory when GPU HBM is insufficient.
KV cache offloading stores key-value cache data in CPU memory when GPU HBM is exhausted, enabling:
| Scenario | KV Offloading Needed? | Better Alternative |
|---|---|---|
| Nemotron Nano 30B, 32K context | No | Mamba hybrid has minimal KV cache |
| MiniMax M2.5, 32K context | No | 132 GB/GPU KV budget is ample |
| DeepSeek V3.2, 32K context | No | MLA compression keeps cache small |
| GLM-5, 128K+ context | Maybe | Reduce --max-model-len first |
| Any model, 1M context | Yes | No alternative at this scale |
$ vllm serve <model> \
--tensor-parallel-size <TP> \
--cpu-offload-gb 32 \
--trust-remote-code
The --cpu-offload-gb flag allocates a CPU memory buffer for KV cache overflow. When GPU KV cache is full, least-recently-used blocks are moved to CPU memory.
| Flag | Purpose | Example |
|---|---|---|
--cpu-offload-gb |
CPU buffer size in GiB | 32 |
--kv-cache-dtype fp8 |
Halve KV cache size (GPU + CPU) | Reduces need for offloading |
--gpu-memory-utilization |
GPU VRAM fraction for model + cache | 0.95 for maximum GPU cache |
KV cache offloading works differently depending on the attention architecture:
| Model | Attention | Offloading Support | Notes |
|---|---|---|---|
| Nemotron Nano 30B | Mamba + Transformer | Supported (Transformer layers only) | Mamba layers have no KV cache to offload |
| MiniMax M2.5 | Lightning Attention | Supported (SoftMax layers only) | Linear attention layers use fixed-size state |
| GLM-5 744B | DSA | Supported | Standard KV format for attention layers |
| DeepSeek V3.2 685B | MLA | Limited | MLA's compressed latent format may not be compatible with all offloading implementations |
| Context Length | Concurrent Requests | Recommended --cpu-offload-gb |
|---|---|---|
| 32K | 100-500 | 16-32 GiB |
| 64K | 100-500 | 32-64 GiB |
| 128K | 50-200 | 64-128 GiB |
| 256K+ | Any | 128+ GiB |
The NVIDIA HGX B200 node has ample system RAM (typically 1-2 TB), so large CPU buffers are feasible.
CPU memory bandwidth (~200 GB/s DDR5) is 40x slower than NVIDIA HGX B200 HBM (8.0 TB/s). Offloading adds latency proportional to the cache blocks transferred:
| Metric | Without Offloading | With Offloading |
|---|---|---|
| TTFT | Baseline | +10-50% (context-dependent) |
| TPOT | Baseline | +5-20% (if active tokens hit CPU cache) |
| Max context | Limited by GPU VRAM | Extended by CPU buffer |
| Max concurrent | Limited by GPU VRAM | Extended by CPU buffer |
The performance impact depends on how often the active working set exceeds GPU cache. If most requests fit in GPU VRAM and only overflow requests hit CPU, the average impact is small.
Before enabling KV cache offloading, try these approaches first: they maintain full GPU-speed operation.
Halves model weight memory, freeing VRAM for KV cache:
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--tensor-parallel-size 1 \
--trust-remote-code
Halves per-token KV cache size independently of model quantization:
$ vllm serve <model> --kv-cache-dtype fp8
--max-model-len 16384 # Instead of 32768
--gpu-memory-utilization 0.95 # Instead of 0.90
Adds ~9 GB KV capacity per GPU. Use with caution under peak load.
Higher tensor parallelism distributes model weights across more GPUs, freeing VRAM per GPU for KV cache:
--tensor-parallel-size 4 # Instead of 2
NVIDIA Dynamo extends KV cache offloading with a multi-tier hierarchy:
Dynamo's KV-aware router intelligently places requests on GPUs that already have relevant cache, minimizing transfers.
See Dynamo Overview for details. Dynamo 0.9.1 was successfully started on this node with vLLM 0.16.0, though tiered caching requires the full Dynamo stack (etcd, NATS) which was not deployed.