Extend effective memory by offloading KV cache to CPU memory.
KV cache offloading stores key-value cache data in CPU memory when GPU HBM is exhausted, enabling:
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model MODEL_NAME \
--kv-offloading-backend native \
--kv-offloading-size 64 \
--disable-hybrid-kv-cache-manager
| Flag | Purpose | Example |
|---|---|---|
--kv-offloading-backend |
Offloading method | native or lmcache |
--kv-offloading-size |
Buffer size in GiB | 64 |
--disable-hybrid-kv-cache-manager |
Disable hybrid management | Required for some models |
--cpu-offload-gb |
Model weight offload | 10 |
native: vLLM's built-in offloading (recommended)lmcache: LMCache integration for advanced caching| Model | Attention | KV Offloading |
|---|---|---|
| DeepSeek V3.2 (685B) | MLA | Not Supported |
| Kimi-K2.5 (1T) | MLA | Not Supported |
| Llama 3.1 (405B) | GQA | Supported |
| Qwen3-VL (235B) | GQA | Supported |
KeyError: 'model.layers.0.self_attn.indexer.k_cache'MLA (Multi-head Latent Attention) uses an indexer-based KV cache structure that the OffloadingConnector cannot handle.
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 32768
docker run --rm \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--ipc=host \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
-p 8000:8000 \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--kv-offloading-backend native \
--kv-offloading-size 64 \
--disable-hybrid-kv-cache-manager
| Context Length | Recommended --kv-offloading-size |
|---|---|
| 32K | 32-64 GiB |
| 64K | 64-96 GiB |
| 128K | 96-128 GiB |
Before enabling KV offloading, consider:
Reduces KV cache memory by 50%:
--kv-cache-dtype fp8
--max-model-len 16384
--gpu-memory-utilization 0.95
KV offloading adds latency due to CPU-GPU transfers:
| Metric | Without Offloading | With Offloading |
|---|---|---|
| TTFT | Baseline | Increased |
| Throughput | Baseline | Reduced |
| Max Context | Limited by HBM | Extended |
Exact impact varies by model and workload. The trade-off is worthwhile when you need context lengths or concurrency that wouldn't otherwise fit.
RuntimeError: wrong! device_gemm with the specified compilation parameters does not support this GEMM problemThis occurs with large models when AITER GEMM kernels encounter incompatible configurations. Adding KV offloading flags resolves it, even if you don't need the extra memory:
--kv-offloading-backend native \n--kv-offloading-size 64 \n--disable-hybrid-kv-cache-manager
Affected models: Llama-3.1-405B, Qwen3-VL-235B, and other large MoE models.
Increase the offload buffer:
--kv-offloading-size 128
Or reduce context length:
--max-model-len 16384