KV Cache Optimization

Updated on 11 March, 2026

The KV cache is often the dominant consumer of GPU VRAM during inference. With 179 GB per NVIDIA HGX B200 GPU, efficient cache management determines how many concurrent requests you can serve and at what context length.


How KV Cache Scales

For each request, the model stores key-value tensors for every token in the context. The cache grows linearly with:

  • Context length: longer conversations or documents consume more cache
  • Concurrent requests: each active request has its own cache allocation
  • Number of KV heads: model architecture dependent
KV cache per request = 2 × num_layers × num_kv_heads × head_dim × context_length × dtype_size

Architecture-Specific KV Behavior

The five models in this cookbook use fundamentally different attention mechanisms, which directly affects their KV cache requirements:

Nemotron Nano 30B: Minimal KV Cache (Mamba Hybrid)

Mamba (SSM) layers have no KV cache: they compress sequence history into a fixed-size state. Only the interleaved Transformer layers maintain standard KV cache. This means:

  • KV cache grows much slower with context length than pure-attention models
  • Single GPU deployment is feasible at long context even in FP8
  • The model can handle many more concurrent requests before running out of VRAM

With TP=2 in FP8, Nemotron Nano uses only ~15 GB per GPU for weights, leaving ~146 GB for KV cache: enough for 1,298 concurrent sequences at 32K context.

MiniMax M2.5 229B: Lightning Attention (Linear + SoftMax Hybrid)

MiniMax M2.5 uses Lightning Attention: linear attention (O(n)) for intra-chunk processing and standard SoftMax attention for inter-chunk. The linear attention layers don't require per-token KV storage: they accumulate into a fixed-size state, similar to Mamba.

This hybrid approach means:

  • Lower per-token KV overhead than pure-attention models
  • Better scaling at very long contexts
  • Still requires KV cache for the SoftMax attention layers

GLM-5 744B: Differential Sparse Attention (DSA)

GLM-5 uses Differential Sparse Attention which selectively attends to important tokens and compresses the rest. This reduces effective KV cache size by discarding low-importance entries during generation.

With TP=8 in FP8, the model weights consume ~89 GB per GPU, leaving ~64 GB per GPU for KV cache.

Nemotron Super 49B: Standard Multi-Head Attention (Dense)

Nemotron Super 49B is a dense transformer: all 49B parameters are active on every token, with standard multi-head attention. This means full KV cache per layer, with no MoE routing or SSM compression.

With TP=1 in FP8, the model weights consume ~49 GB on a single GPU, leaving ~109 GB for KV cache.

DeepSeek V3.2 685B: Multi-Latent Attention (MLA)

MLA compresses KV projections into a lower-dimensional latent space before caching. Instead of storing full key and value tensors per head, MLA stores a single compressed latent vector per token:

  • Standard GQA: cache_per_token = 2 × num_kv_heads × head_dim × dtype_size
  • MLA: cache_per_token = latent_dim × dtype_size (much smaller)

This compression means DeepSeek V3.2 uses significantly less KV cache per token than a standard 685B model would. However, MLA's compressed format means standard KV cache offloading and some optimization features don't apply.

FP8 KV Cache

Independently from model weight quantization, you can quantize the KV cache to FP8, halving per-request memory:

console
$ vllm serve <model> --kv-cache-dtype fp8

Compatibility:

Model KV Cache FP8 Notes
Nemotron Nano 30B Supported Minimal impact since Mamba layers have no KV cache
Nemotron Super 49B Supported Standard MHA, full benefit from FP8 KV
MiniMax M2.5 Check vLLM support Lightning Attention layers may not benefit
GLM-5 744B Supported Reduces cache for standard attention layers
DeepSeek V3.2 Limited benefit MLA already compresses KV projections
Note
FP8 KV cache is independent of model weight quantization. You can use an FP8-quantized model with BF16 KV cache, or a BF16 model with FP8 KV cache.

VRAM Budget Breakdown

For each model on the NVIDIA HGX B200 (179 GB per GPU, --gpu-memory-utilization 0.90):

Usable VRAM per GPU = 179 GB × 0.90 = ~161 GB
Model weights per GPU = total_model_size / tensor_parallel_size
KV cache budget = Usable VRAM - Model weights - CUDA overhead (~2-5 GB)
Model Quant TP Weights/GPU KV Budget/GPU Relative KV Capacity
Nemotron Nano 30B FP8 2 ~15 GB ~146 GB Very high
Nemotron Super 49B FP8 1 ~49 GB ~109 GB High
MiniMax M2.5 229B FP8 4 ~29 GB ~132 GB High
GLM-5 744B FP8 8 ~89 GB ~64 GB Moderate
DeepSeek V3.2 685B FP8 8 ~86 GB ~75 GB Moderate (MLA compressed)

Increasing KV Cache Capacity

1. Use NVFP4 Quantization (NVIDIA HGX B200 Only)

NVFP4 halves model weight memory compared to FP8. The biggest benefit isn't more KV cache per se: it's the ability to reduce tensor parallelism, freeing entire GPUs:

console
# Nemotron Nano: FP8 needs TP=2, NVFP4 fits on TP=1
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

2. Increase GPU Memory Utilization

The default 0.90 reserves 10% as a safety margin. For dedicated inference servers with no other GPU workloads:

console
--gpu-memory-utilization 0.95

This adds ~9 GB of KV cache capacity per GPU. Use with caution: too high can cause OOM under peak load.

3. Reduce Context Length

If your workload doesn't need the full context window:

console
--max-model-len 8192   # Instead of 32768

This doesn't directly save VRAM (vLLM pre-allocates based on capacity, not length), but it allows more concurrent sequences since each sequence's maximum possible cache is smaller.

4. FP8 KV Cache

For models where it's supported:

console
--kv-cache-dtype fp8

Halves the per-token cache size at the cost of slight precision loss in attention scores.

Dynamo Tiered KV Caching

NVIDIA Dynamo extends KV cache beyond GPU VRAM with a multi-tier hierarchy:

  1. GPU HBM: Primary (179 GB per NVIDIA HGX B200)
  2. CPU DRAM: Host memory offload
  3. Local NVMe SSD: Persistent cache across requests
  4. Remote storage: NFS, S3-compatible

Tiered caching extends effective capacity beyond the NVIDIA HGX B200's 179 GB GPU VRAM: hot cache in HBM processes at full 8.0 TB/s bandwidth, while cold data spills to CPU/disk.

See Dynamo Overview for setup details.

Prefix Caching

vLLM supports automatic prefix caching, which reuses KV cache blocks across requests that share the same prompt prefix (e.g., system prompts, few-shot examples):

console
# Enabled by default in vLLM V1
# Verify in logs: "Prefix cache hit rate: XX%"

Our benchmarks show high prefix cache hit rates (>99% at steady state with random dataset) because vLLM's block allocator aggressively reuses cached blocks. In production with repeated system prompts, this significantly reduces TTFT for the second request onward.

Monitoring KV Cache Usage

During serving, vLLM logs KV cache utilization:

Engine 000: Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 18.3%

Key indicators:

  • KV cache usage > 90%: Approaching capacity, TTFT will increase as requests queue
  • Waiting > 0: Requests are queued because KV cache is full
  • Prefix cache hit rate: Higher is better; indicates effective cache reuse

Comments