Focus Mode

KV Cache Offloading

Updated on 11 March, 2026

Extend effective memory by offloading KV cache to CPU memory when GPU HBM is insufficient.

Overview

KV cache offloading stores key-value cache data in CPU memory when GPU HBM is exhausted, enabling:

Longer context lengths beyond what fits in GPU VRAM
More concurrent requests at a given context length
Trading latency for capacity

Note

NVIDIA HGX B200 Context: With 179 GB HBM per GPU and 8.0 TB/s bandwidth, the NVIDIA HGX B200 rarely needs KV offloading for typical workloads. Consider it only for extreme context lengths (128K+) or very high concurrency with large models.

When to Consider Offloading

Scenario	KV Offloading Needed?	Better Alternative
Nemotron Nano 30B, 32K context	No	Mamba hybrid has minimal KV cache
MiniMax M2.5, 32K context	No	132 GB/GPU KV budget is ample
DeepSeek V3.2, 32K context	No	MLA compression keeps cache small
GLM-5, 128K+ context	Maybe	Reduce `--max-model-len` first
Any model, 1M context	Yes	No alternative at this scale

Basic Usage

                            console
                            
                        
$ vllm serve <model> \
  --tensor-parallel-size <TP> \
  --cpu-offload-gb 32 \
  --trust-remote-code

The --cpu-offload-gb flag allocates a CPU memory buffer for KV cache overflow. When GPU KV cache is full, least-recently-used blocks are moved to CPU memory.

Configuration Options

Flag	Purpose	Example
`--cpu-offload-gb`	CPU buffer size in GiB	`32`
`--kv-cache-dtype fp8`	Halve KV cache size (GPU + CPU)	Reduces need for offloading
`--gpu-memory-utilization`	GPU VRAM fraction for model + cache	`0.95` for maximum GPU cache

Model Compatibility

KV cache offloading works differently depending on the attention architecture:

Model	Attention	Offloading Support	Notes
Nemotron Nano 30B	Mamba + Transformer	Supported (Transformer layers only)	Mamba layers have no KV cache to offload
MiniMax M2.5	Lightning Attention	Supported (SoftMax layers only)	Linear attention layers use fixed-size state
GLM-5 744B	DSA	Supported	Standard KV format for attention layers
DeepSeek V3.2 685B	MLA	Limited	MLA's compressed latent format may not be compatible with all offloading implementations

Sizing Guidelines

CPU Buffer Size

Context Length	Concurrent Requests	Recommended `--cpu-offload-gb`
32K	100-500	16-32 GiB
64K	100-500	32-64 GiB
128K	50-200	64-128 GiB
256K+	Any	128+ GiB

The NVIDIA HGX B200 node has ample system RAM (typically 1-2 TB), so large CPU buffers are feasible.

Memory Bandwidth Impact

CPU memory bandwidth (~200 GB/s DDR5) is 40x slower than NVIDIA HGX B200 HBM (8.0 TB/s). Offloading adds latency proportional to the cache blocks transferred:

Metric	Without Offloading	With Offloading
TTFT	Baseline	+10-50% (context-dependent)
TPOT	Baseline	+5-20% (if active tokens hit CPU cache)
Max context	Limited by GPU VRAM	Extended by CPU buffer
Max concurrent	Limited by GPU VRAM	Extended by CPU buffer

The performance impact depends on how often the active working set exceeds GPU cache. If most requests fit in GPU VRAM and only overflow requests hit CPU, the average impact is small.

Alternatives to Offloading

Before enabling KV cache offloading, try these approaches first: they maintain full GPU-speed operation.

1. NVFP4 Quantization (NVIDIA HGX B200 Exclusive)

Halves model weight memory, freeing VRAM for KV cache:

                            console
                            
                        
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 \
  --trust-remote-code

2. FP8 KV Cache

Halves per-token KV cache size independently of model quantization:

                            console
                            
$ vllm serve <model> --kv-cache-dtype fp8

3. Reduce Context Length

                            console
                            
--max-model-len 16384  # Instead of 32768

4. Increase GPU Memory Utilization

                            console
                            
--gpu-memory-utilization 0.95  # Instead of 0.90

Adds ~9 GB KV capacity per GPU. Use with caution under peak load.

5. Use More GPUs (Increase TP)

Higher tensor parallelism distributes model weights across more GPUs, freeing VRAM per GPU for KV cache:

                            console
                            
--tensor-parallel-size 4  # Instead of 2

Dynamo Tiered Caching

NVIDIA Dynamo extends KV cache offloading with a multi-tier hierarchy:

GPU HBM: Primary (179 GB per NVIDIA HGX B200, 8.0 TB/s)
CPU DRAM: Host memory (1-2 TB, ~200 GB/s)
Local NVMe SSD: Persistent cache (~7 GB/s)
Remote storage: NFS, S3-compatible

Dynamo's KV-aware router intelligently places requests on GPUs that already have relevant cache, minimizing transfers.

See Dynamo Overview for details. Dynamo 0.9.1 was successfully started on this node with vLLM 0.16.0, though tiered caching requires the full Dynamo stack (etcd, NATS) which was not deployed.

KV Cache Offloading

Overview

When to Consider Offloading

Basic Usage

Configuration Options

Model Compatibility

Sizing Guidelines

CPU Buffer Size

Memory Bandwidth Impact

Alternatives to Offloading

1. NVFP4 Quantization (NVIDIA HGX B200 Exclusive)

2. FP8 KV Cache

3. Reduce Context Length

4. Increase GPU Memory Utilization

5. Use More GPUs (Increase TP)

Dynamo Tiered Caching

Comments

Products

Features

Solutions

Marketplace

Resources

Company