KV Cache Offloading

Updated on 11 March, 2026

Extend effective memory by offloading KV cache to CPU memory when GPU HBM is insufficient.


Overview

KV cache offloading stores key-value cache data in CPU memory when GPU HBM is exhausted, enabling:

  • Longer context lengths beyond what fits in GPU VRAM
  • More concurrent requests at a given context length
  • Trading latency for capacity
Note
NVIDIA HGX B200 Context: With 179 GB HBM per GPU and 8.0 TB/s bandwidth, the NVIDIA HGX B200 rarely needs KV offloading for typical workloads. Consider it only for extreme context lengths (128K+) or very high concurrency with large models.

When to Consider Offloading

Scenario KV Offloading Needed? Better Alternative
Nemotron Nano 30B, 32K context No Mamba hybrid has minimal KV cache
MiniMax M2.5, 32K context No 132 GB/GPU KV budget is ample
DeepSeek V3.2, 32K context No MLA compression keeps cache small
GLM-5, 128K+ context Maybe Reduce --max-model-len first
Any model, 1M context Yes No alternative at this scale

Basic Usage

console
$ vllm serve <model> \
  --tensor-parallel-size <TP> \
  --cpu-offload-gb 32 \
  --trust-remote-code

The --cpu-offload-gb flag allocates a CPU memory buffer for KV cache overflow. When GPU KV cache is full, least-recently-used blocks are moved to CPU memory.

Configuration Options

Flag Purpose Example
--cpu-offload-gb CPU buffer size in GiB 32
--kv-cache-dtype fp8 Halve KV cache size (GPU + CPU) Reduces need for offloading
--gpu-memory-utilization GPU VRAM fraction for model + cache 0.95 for maximum GPU cache

Model Compatibility

KV cache offloading works differently depending on the attention architecture:

Model Attention Offloading Support Notes
Nemotron Nano 30B Mamba + Transformer Supported (Transformer layers only) Mamba layers have no KV cache to offload
MiniMax M2.5 Lightning Attention Supported (SoftMax layers only) Linear attention layers use fixed-size state
GLM-5 744B DSA Supported Standard KV format for attention layers
DeepSeek V3.2 685B MLA Limited MLA's compressed latent format may not be compatible with all offloading implementations

Sizing Guidelines

CPU Buffer Size

Context Length Concurrent Requests Recommended --cpu-offload-gb
32K 100-500 16-32 GiB
64K 100-500 32-64 GiB
128K 50-200 64-128 GiB
256K+ Any 128+ GiB

The NVIDIA HGX B200 node has ample system RAM (typically 1-2 TB), so large CPU buffers are feasible.

Memory Bandwidth Impact

CPU memory bandwidth (~200 GB/s DDR5) is 40x slower than NVIDIA HGX B200 HBM (8.0 TB/s). Offloading adds latency proportional to the cache blocks transferred:

Metric Without Offloading With Offloading
TTFT Baseline +10-50% (context-dependent)
TPOT Baseline +5-20% (if active tokens hit CPU cache)
Max context Limited by GPU VRAM Extended by CPU buffer
Max concurrent Limited by GPU VRAM Extended by CPU buffer

The performance impact depends on how often the active working set exceeds GPU cache. If most requests fit in GPU VRAM and only overflow requests hit CPU, the average impact is small.

Alternatives to Offloading

Before enabling KV cache offloading, try these approaches first: they maintain full GPU-speed operation.

1. NVFP4 Quantization (NVIDIA HGX B200 Exclusive)

Halves model weight memory, freeing VRAM for KV cache:

console
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 \
  --trust-remote-code

2. FP8 KV Cache

Halves per-token KV cache size independently of model quantization:

console
$ vllm serve <model> --kv-cache-dtype fp8

3. Reduce Context Length

console
--max-model-len 16384  # Instead of 32768

4. Increase GPU Memory Utilization

console
--gpu-memory-utilization 0.95  # Instead of 0.90

Adds ~9 GB KV capacity per GPU. Use with caution under peak load.

5. Use More GPUs (Increase TP)

Higher tensor parallelism distributes model weights across more GPUs, freeing VRAM per GPU for KV cache:

console
--tensor-parallel-size 4  # Instead of 2

Dynamo Tiered Caching

NVIDIA Dynamo extends KV cache offloading with a multi-tier hierarchy:

  1. GPU HBM: Primary (179 GB per NVIDIA HGX B200, 8.0 TB/s)
  2. CPU DRAM: Host memory (1-2 TB, ~200 GB/s)
  3. Local NVMe SSD: Persistent cache (~7 GB/s)
  4. Remote storage: NFS, S3-compatible

Dynamo's KV-aware router intelligently places requests on GPUs that already have relevant cache, minimizing transfers.

See Dynamo Overview for details. Dynamo 0.9.1 was successfully started on this node with vLLM 0.16.0, though tiered caching requires the full Dynamo stack (etcd, NATS) which was not deployed.

Comments