KV Cache Offloading

Updated on 17 March, 2026

Extend effective memory by offloading KV cache to CPU memory.


Overview

KV cache offloading stores key-value cache data in CPU memory when GPU HBM is exhausted, enabling:

  • Longer context lengths
  • More concurrent requests
  • Better memory utilization
Note
With 192-256GB HBM per GPU, KV cache offloading is often unnecessary. Consider it only for extreme workloads.

Basic Usage

bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --kv-offloading-backend native \
  --kv-offloading-size 64 \
  --disable-hybrid-kv-cache-manager

Configuration Options

Flag Purpose Example
--kv-offloading-backend Offloading method native or lmcache
--kv-offloading-size Buffer size in GiB 64
--disable-hybrid-kv-cache-manager Disable hybrid management Required for some models
--cpu-offload-gb Model weight offload 10

Backend Options

  • native: vLLM's built-in offloading (recommended)
  • lmcache: LMCache integration for advanced caching

Model Compatibility

Warning
KV cache offloading only works with standard attention architectures (GQA, MHA). MLA models are not supported.
Model Attention KV Offloading
DeepSeek V3.2 (685B) MLA Not Supported
Kimi-K2.5 (1T) MLA Not Supported
Llama 3.1 (405B) GQA Supported
Qwen3-VL (235B) GQA Supported

DeepSeek V3.2 (685B) Error

KeyError: 'model.layers.0.self_attn.indexer.k_cache'

MLA (Multi-head Latent Attention) uses an indexer-based KV cache structure that the OffloadingConnector cannot handle.

Working Configurations

Llama-3.1-405B

bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768

Qwen3-VL-235B

bash
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --kv-offloading-backend native \
  --kv-offloading-size 64 \
  --disable-hybrid-kv-cache-manager

Sizing Guidelines

Offload Buffer Size

Context Length Recommended --kv-offloading-size
32K 32-64 GiB
64K 64-96 GiB
128K 96-128 GiB

When to Use Offloading

  1. 128K+ context with large batch sizes
  2. 1000+ concurrent sequences
  3. Large models (200B+) with long context

Alternatives to Offloading

Before enabling KV offloading, consider:

1. FP8 KV Cache

Reduces KV cache memory by 50%:

bash
--kv-cache-dtype fp8

2. Reduce Context Length

bash
--max-model-len 16384

3. Increase GPU Memory Utilization

bash
--gpu-memory-utilization 0.95

Performance Impact

KV offloading adds latency due to CPU-GPU transfers:

Metric Without Offloading With Offloading
TTFT Baseline Increased
Throughput Baseline Reduced
Max Context Limited by HBM Extended

Exact impact varies by model and workload. The trade-off is worthwhile when you need context lengths or concurrency that wouldn't otherwise fit.

Troubleshooting

GEMM Kernel Error (Llama-405B, MoE models)

RuntimeError: wrong! device_gemm with the specified compilation parameters does not support this GEMM problem

This occurs with large models when AITER GEMM kernels encounter incompatible configurations. Adding KV offloading flags resolves it, even if you don't need the extra memory:

bash
--kv-offloading-backend native \n--kv-offloading-size 64 \n--disable-hybrid-kv-cache-manager

Affected models: Llama-3.1-405B, Qwen3-VL-235B, and other large MoE models.

OOM Despite Offloading

Increase the offload buffer:

bash
--kv-offloading-size 128

Or reduce context length:

bash
--max-model-len 16384

Comments