The KV cache is often the dominant consumer of GPU VRAM during inference. With 179 GB per NVIDIA HGX B200 GPU, efficient cache management determines how many concurrent requests you can serve and at what context length.
For each request, the model stores key-value tensors for every token in the context. The cache grows linearly with:
KV cache per request = 2 × num_layers × num_kv_heads × head_dim × context_length × dtype_sizeThe five models in this cookbook use fundamentally different attention mechanisms, which directly affects their KV cache requirements:
Mamba (SSM) layers have no KV cache: they compress sequence history into a fixed-size state. Only the interleaved Transformer layers maintain standard KV cache. This means:
With TP=2 in FP8, Nemotron Nano uses only ~15 GB per GPU for weights, leaving ~146 GB for KV cache: enough for 1,298 concurrent sequences at 32K context.
MiniMax M2.5 uses Lightning Attention: linear attention (O(n)) for intra-chunk processing and standard SoftMax attention for inter-chunk. The linear attention layers don't require per-token KV storage: they accumulate into a fixed-size state, similar to Mamba.
This hybrid approach means:
GLM-5 uses Differential Sparse Attention which selectively attends to important tokens and compresses the rest. This reduces effective KV cache size by discarding low-importance entries during generation.
With TP=8 in FP8, the model weights consume ~89 GB per GPU, leaving ~64 GB per GPU for KV cache.
Nemotron Super 49B is a dense transformer: all 49B parameters are active on every token, with standard multi-head attention. This means full KV cache per layer, with no MoE routing or SSM compression.
With TP=1 in FP8, the model weights consume ~49 GB on a single GPU, leaving ~109 GB for KV cache.
MLA compresses KV projections into a lower-dimensional latent space before caching. Instead of storing full key and value tensors per head, MLA stores a single compressed latent vector per token:
cache_per_token = 2 × num_kv_heads × head_dim × dtype_sizecache_per_token = latent_dim × dtype_size (much smaller)This compression means DeepSeek V3.2 uses significantly less KV cache per token than a standard 685B model would. However, MLA's compressed format means standard KV cache offloading and some optimization features don't apply.
Independently from model weight quantization, you can quantize the KV cache to FP8, halving per-request memory:
$ vllm serve <model> --kv-cache-dtype fp8
Compatibility:
| Model | KV Cache FP8 | Notes |
|---|---|---|
| Nemotron Nano 30B | Supported | Minimal impact since Mamba layers have no KV cache |
| Nemotron Super 49B | Supported | Standard MHA, full benefit from FP8 KV |
| MiniMax M2.5 | Check vLLM support | Lightning Attention layers may not benefit |
| GLM-5 744B | Supported | Reduces cache for standard attention layers |
| DeepSeek V3.2 | Limited benefit | MLA already compresses KV projections |
For each model on the NVIDIA HGX B200 (179 GB per GPU, --gpu-memory-utilization 0.90):
Usable VRAM per GPU = 179 GB × 0.90 = ~161 GB
Model weights per GPU = total_model_size / tensor_parallel_size
KV cache budget = Usable VRAM - Model weights - CUDA overhead (~2-5 GB)| Model | Quant | TP | Weights/GPU | KV Budget/GPU | Relative KV Capacity |
|---|---|---|---|---|---|
| Nemotron Nano 30B | FP8 | 2 | ~15 GB | ~146 GB | Very high |
| Nemotron Super 49B | FP8 | 1 | ~49 GB | ~109 GB | High |
| MiniMax M2.5 229B | FP8 | 4 | ~29 GB | ~132 GB | High |
| GLM-5 744B | FP8 | 8 | ~89 GB | ~64 GB | Moderate |
| DeepSeek V3.2 685B | FP8 | 8 | ~86 GB | ~75 GB | Moderate (MLA compressed) |
NVFP4 halves model weight memory compared to FP8. The biggest benefit isn't more KV cache per se: it's the ability to reduce tensor parallelism, freeing entire GPUs:
# Nemotron Nano: FP8 needs TP=2, NVFP4 fits on TP=1
$ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
The default 0.90 reserves 10% as a safety margin. For dedicated inference servers with no other GPU workloads:
--gpu-memory-utilization 0.95
This adds ~9 GB of KV cache capacity per GPU. Use with caution: too high can cause OOM under peak load.
If your workload doesn't need the full context window:
--max-model-len 8192 # Instead of 32768
This doesn't directly save VRAM (vLLM pre-allocates based on capacity, not length), but it allows more concurrent sequences since each sequence's maximum possible cache is smaller.
For models where it's supported:
--kv-cache-dtype fp8
Halves the per-token cache size at the cost of slight precision loss in attention scores.
NVIDIA Dynamo extends KV cache beyond GPU VRAM with a multi-tier hierarchy:
Tiered caching extends effective capacity beyond the NVIDIA HGX B200's 179 GB GPU VRAM: hot cache in HBM processes at full 8.0 TB/s bandwidth, while cold data spills to CPU/disk.
See Dynamo Overview for setup details.
vLLM supports automatic prefix caching, which reuses KV cache blocks across requests that share the same prompt prefix (e.g., system prompts, few-shot examples):
# Enabled by default in vLLM V1
# Verify in logs: "Prefix cache hit rate: XX%"
Our benchmarks show high prefix cache hit rates (>99% at steady state with random dataset) because vLLM's block allocator aggressively reuses cached blocks. In production with repeated system prompts, this significantly reduces TTFT for the second request onward.
During serving, vLLM logs KV cache utilization:
Engine 000: Running: 128 reqs, Waiting: 0 reqs, GPU KV cache usage: 18.3%Key indicators: