Focus Mode

KV Cache Offloading

Updated on 17 March, 2026

Extend effective memory by offloading KV cache to CPU memory.

Overview

KV cache offloading stores key-value cache data in CPU memory when GPU HBM is exhausted, enabling:

Longer context lengths
More concurrent requests
Better memory utilization

Note

With 192-256GB HBM per GPU, KV cache offloading is often unnecessary. Consider it only for extreme workloads.

Basic Usage

                            bash
                            
                        
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME \
  --kv-offloading-backend native \
  --kv-offloading-size 64 \
  --disable-hybrid-kv-cache-manager

Configuration Options

Flag	Purpose	Example
`--kv-offloading-backend`	Offloading method	`native` or `lmcache`
`--kv-offloading-size`	Buffer size in GiB	`64`
`--disable-hybrid-kv-cache-manager`	Disable hybrid management	Required for some models
`--cpu-offload-gb`	Model weight offload	`10`

Backend Options

native: vLLM's built-in offloading (recommended)
lmcache: LMCache integration for advanced caching

Model Compatibility

Warning

KV cache offloading only works with standard attention architectures (GQA, MHA). MLA models are not supported.

Model	Attention	KV Offloading
DeepSeek V3.2 (685B)	MLA	Not Supported
Kimi-K2.5 (1T)	MLA	Not Supported
Llama 3.1 (405B)	GQA	Supported
Qwen3-VL (235B)	GQA	Supported

DeepSeek V3.2 (685B) Error

KeyError: 'model.layers.0.self_attn.indexer.k_cache'

MLA (Multi-head Latent Attention) uses an indexer-based KV cache structure that the OffloadingConnector cannot handle.

Working Configurations

Llama-3.1-405B

                            bash
                            
                        
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768

Qwen3-VL-235B

                            bash
                            
                        
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --kv-offloading-backend native \
  --kv-offloading-size 64 \
  --disable-hybrid-kv-cache-manager

Sizing Guidelines

Offload Buffer Size

Context Length	Recommended `--kv-offloading-size`
32K	32-64 GiB
64K	64-96 GiB
128K	96-128 GiB

When to Use Offloading

128K+ context with large batch sizes
1000+ concurrent sequences
Large models (200B+) with long context

Alternatives to Offloading

Before enabling KV offloading, consider:

1. FP8 KV Cache

Reduces KV cache memory by 50%:

bash

--kv-cache-dtype fp8

2. Reduce Context Length

bash

--max-model-len 16384

3. Increase GPU Memory Utilization

bash

--gpu-memory-utilization 0.95

Performance Impact

KV offloading adds latency due to CPU-GPU transfers:

Metric	Without Offloading	With Offloading
TTFT	Baseline	Increased
Throughput	Baseline	Reduced
Max Context	Limited by HBM	Extended

Exact impact varies by model and workload. The trade-off is worthwhile when you need context lengths or concurrency that wouldn't otherwise fit.

Troubleshooting

GEMM Kernel Error (Llama-405B, MoE models)

RuntimeError: wrong! device_gemm with the specified compilation parameters does not support this GEMM problem

This occurs with large models when AITER GEMM kernels encounter incompatible configurations. Adding KV offloading flags resolves it, even if you don't need the extra memory:

                            bash
                            
--kv-offloading-backend native \n--kv-offloading-size 64 \n--disable-hybrid-kv-cache-manager

Affected models: Llama-3.1-405B, Qwen3-VL-235B, and other large MoE models.

OOM Despite Offloading

Increase the offload buffer:

bash

--kv-offloading-size 128

Or reduce context length:

bash

--max-model-len 16384

KV Cache Offloading

Overview

Basic Usage

Configuration Options

Backend Options

Model Compatibility

DeepSeek V3.2 (685B) Error

Working Configurations

Llama-3.1-405B

Qwen3-VL-235B

Sizing Guidelines

Offload Buffer Size

When to Use Offloading

Alternatives to Offloading

1. FP8 KV Cache

2. Reduce Context Length

3. Increase GPU Memory Utilization

Performance Impact

Troubleshooting

GEMM Kernel Error (Llama-405B, MoE models)

OOM Despite Offloading

Comments

Products

Features

Solutions

Marketplace

Resources

Company