How to Configure Smart Routing in NVIDIA Dynamo

Updated on 12 March, 2026
Guide
Deploy NVIDIA Dynamo KV-aware routing for distributed LLM inference to reduce TTFT, improve throughput, and optimize GPU utilization.
How to Configure Smart Routing in NVIDIA Dynamo header image

NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework designed to deploy large-scale generative AI and reasoning models across multi-node, multi-GPU environments. The platform's KV Router intelligently directs inference requests by evaluating computational costs across workers, considering both prefill and decode phases. By maximizing KV cache reuse and minimizing redundant computation, smart routing reduces time-to-first-token (TTFT) by up to 40% and increases overall throughput by 25-30% compared to round-robin load balancing.

This guide outlines the configuration of KV-aware smart routing in NVIDIA Dynamo for distributed inference deployments. It covers infrastructure setup, router enablement for both aggregated and disaggregated serving patterns, parameter tuning for optimal performance, and monitoring techniques for production deployments.

Prerequisites

Before you begin, ensure you:

Understanding KV-Aware Routing

Traditional load balancing strategies like round-robin or random routing distribute requests evenly across workers without considering their cached state. KV-aware routing optimizes performance by directing requests to workers that already have relevant KV cache blocks, reducing redundant computation during the prefill phase.

How KV Cache Routing Works

The KV Router makes routing decisions based on a cost function that balances two factors:

1. Prefill Cost

The router calculates prefill cost by determining how many tokens need processing from scratch. When a worker has cached KV blocks matching the request's input prefix, those blocks can be reused, reducing prefill computation.

Prefill blocks = (Total input tokens - Cached tokens) / Block size

2. Decode Cost

The router tracks active blocks on each worker to estimate decode capacity. Workers with fewer active sequences have more capacity for new requests.

3. Combined Cost Formula

cost = overlap_score_weight × prefill_blocks + decode_blocks

Where:

  • overlap_score_weight balances TTFT optimization (higher values) vs load distribution (lower values).
  • Lower cost indicates a better routing choice.
  • Default overlap_score_weight is 1.0.

Worker Selection Process

The router evaluates all available workers and selects the one with the lowest cost. Consider this example with 3 workers and overlap_score_weight = 1.0:

KV Cache Router Worker Selection Process

Worker Cached Blocks Prefill Blocks Decode Blocks Total Cost Selected
Worker 1 2 8 10 18 (1.0 × 8 + 10) No
Worker 2 5 5 5 10 (1.0 × 5 + 5) Yes
Worker 3 8 2 9 11 (1.0 × 2 + 9) No

Worker 2 provides the best balance of cache reuse and available capacity, minimizing both prefill computation and decode load.

Performance Benefits

KV-aware routing provides measurable improvements over basic load balancing:

  • TTFT Reduction: 30-40% lower time-to-first-token by maximizing cache hits.
  • Throughput Increase: 25-30% higher tokens/second by reducing redundant prefill computation.
  • Better GPU Utilization: More balanced load distribution across workers.
  • Cache Efficiency: Higher cache hit rates through intelligent request placement.

These benefits are most significant for workloads with:

  • Long input prompts (>2000 tokens) where prefill cost dominates.
  • Similar or repetitive input prefixes across requests.
  • Multiple inference workers with independent KV caches.

Clone the Dynamo Repository

NVIDIA Dynamo provides deployment scripts, launch utilities, and runtime modules for inference workloads. Clone the repository to access deployment examples and container runtime scripts.

  1. Clone the repository.

    console
    $ git clone https://github.com/ai-dynamo/dynamo.git
    
  2. Navigate to the repository directory.

    console
    $ cd dynamo
    
  3. Switch to the latest stable release.

    console
    $ git checkout release/0.9.0
    

    Visit the Dynamo releases page to find the latest stable version.

Verify CUDA Version and Pull Container Image

The vLLM container requires a CUDA version match between the host driver and container runtime to prevent GPU kernel incompatibilities. The NVIDIA Container Toolkit maps host GPUs into containers, requiring the container's CUDA version to align with the host driver's supported version.

  1. Check the installed CUDA version.

    console
    $ nvidia-smi
    

    The output displays the CUDA version in the top-right corner of the table.

    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
    +-----------------------------------------+------------------------+----------------------+
    ....
  2. Pull the vLLM container image from NGC. Match the CUDA version in the image tag to your system's CUDA version.

    • For CUDA 13.x, use:

      console
      $ docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0-cuda13
      
    • For CUDA 12.x, use:

      console
      $ docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0
      
    Note
    Visit the NVIDIA NGC Catalog to view all available image tags and CUDA versions.
  3. (Optional) Build the container from source instead of pulling the pre-built image.

    console
    $ ./container/build.sh --framework VLLM
    

    The build process creates an image named dynamo:latest-vllm. If you prefer using this locally built image, replace nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0-cuda13 with dynamo:latest-vllm in all subsequent commands.

Configure Hugging Face Cache Permissions

The container runs as UID 1000 and requires write access to the Hugging Face cache directory for model downloads. Incorrect permissions prevent the container from accessing cached model weights, causing worker initialization failures.

  1. Create the cache directory if it does not exist.

    console
    $ mkdir -p container/.cache/huggingface
    
  2. Set ownership to the container user (UID 1000).

    console
    $ sudo chown -R 1000:1000 container/.cache/huggingface
    
  3. Set appropriate permissions.

    console
    $ sudo chmod -R 775 container/.cache/huggingface
    

Start Infrastructure Services

Dynamo's KV Router requires etcd for worker discovery and NATS for KV cache event propagation. The etcd service maintains a registry of active workers and their capabilities, while NATS handles real-time KV block events (creation and eviction) that enable the router to track cached state across workers.

  1. Start the infrastructure services using Docker Compose.

    console
    $ docker compose -f deploy/docker-compose.yml up -d
    
  2. Verify the services are running.

    console
    $ docker compose -f deploy/docker-compose.yml ps
    

    The output displays the running etcd and NATS containers.

    NAME                   IMAGE                      COMMAND                  SERVICE       CREATED          STATUS          PORTS
    deploy-etcd-server-1   bitnamilegacy/etcd:3.6.1   "/opt/bitnami/script…"   etcd-server   18 seconds ago   Up 18 seconds   0.0.0.0:2379-2380->2379-2380/tcp, [::]:2379-2380->2379-2380/tcp
    deploy-nats-server-1   nats:2.11.4                "/nats-server -c /et…"   nats-server   18 seconds ago   Up 18 seconds   0.0.0.0:4222->4222/tcp, [::]:4222->4222/tcp, 0.0.0.0:6222->6222/tcp, [::]:6222->6222/tcp, 0.0.0.0:8222->8222/tcp, [::]:8222->8222/tcp

KV Router Configuration Parameters

The KV Router uses environment variables to control routing behavior, cache tracking, and worker selection logic. These parameters determine how the router balances Time To First Token (TTFT) optimization through cache reuse against Inter-Token Latency (ITL) through load distribution. All parameters can be set via environment variables or command-line arguments on the Dynamo frontend.

Core Router Parameters

Variable Description Default Recommended Notes
DYN_ROUTER_MODE Enables KV cache-aware routing strategy round_robin kv Set to kv to enable cache-aware routing. Other modes: round_robin, random
DYN_ROUTER_TEMPERATURE Controls routing randomness via softmax sampling 0.0 0.1 0.0 = deterministic (always lowest cost). Higher values add randomness to prevent saturation
DYN_KV_OVERLAP_SCORE_WEIGHT Weight for prefix cache overlaps in cost calculation 1.0 1.5 Formula: cost = weight × prefill_blocks + decode_blocks. Higher values prioritize TTFT over ITL
DYN_KV_CACHE_BLOCK_SIZE KV cache block size in tokens (must match backend) Backend-specific 16 Must match vLLM's block size. Used for prefill block calculation
DYN_KV_EVENTS Enables real-time KV event tracking from workers true true When disabled, router predicts cache state with TTL-based expiration

Worker Capacity Thresholds

These parameters define when a worker is considered "busy" and should be excluded from routing decisions.

Variable Description Default Recommended Notes
DYN_ACTIVE_DECODE_BLOCKS_THRESHOLD Busy threshold based on KV cache block utilization Not set 0.85 Value between 0.0-1.0. Workers exceeding this percentage are excluded. Can update via /busy_threshold API
DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD Busy threshold based on prefill token count Not set 10000 Absolute token count. Workers exceeding this are excluded. Uses OR logic with threshold fraction

Infrastructure Configuration

Variable Description Default Notes
ETCD_ENDPOINTS etcd server URL for worker discovery N/A Format: http://localhost:2379. Multiple endpoints separated by comma
NATS_SERVER NATS server URL for KV event propagation N/A Format: nats://localhost:4222. Required for KV event streaming
DYN_EVENT_PLANE Event plane backend for KV cache events N/A Set to nats to enable NATS-based event publishing
PYTHONHASHSEED Deterministic hash seed for KV event IDs N/A Set to 0 for reproducible KV event ID generation
DYN_SYSTEM_PORT Port for worker system status server System-dependent Used for health checks and metrics. Set per-worker to avoid conflicts (e.g., 8081, 8082)

Parameter Tuning Guidelines

For TTFT Optimization (Low Latency):

  • Increase DYN_KV_OVERLAP_SCORE_WEIGHT to 1.5-2.0 (prioritize cache hits).
  • Lower busy thresholds: DYN_ACTIVE_DECODE_BLOCKS_THRESHOLD=0.70, DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD=500.
  • Keep DYN_ROUTER_TEMPERATURE low (0.0-0.1) for deterministic selection.

For Throughput Optimization (High Load):

  • Lower DYN_KV_OVERLAP_SCORE_WEIGHT to 0.5-1.0 (balance cache hits with load).
  • Higher busy thresholds: DYN_ACTIVE_DECODE_BLOCKS_THRESHOLD=0.95, DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD=2000.
  • Increase DYN_ROUTER_TEMPERATURE to 0.3-0.5 for better load distribution.

When KV Events Are Unavailable:

  • Set DYN_KV_EVENTS=false to use prediction-based routing.
  • Router tracks state via routing decisions with TTL-based expiration (default 120s).
  • Less accurate than real-time events but works without backend KV event support.

Set Up Deployment Environment

Both aggregated and disaggregated serving deployments use the same KV routing configuration. This section prepares the container environment and creates a shared configuration file for all deployment scripts.

  1. Export your Hugging Face token. Replace YOUR_HF_TOKEN with your actual token.

    console
    $ export HF_TOKEN=YOUR_HF_TOKEN
    
  2. Run the vLLM container with GPU access and workspace mounting. Use the image tag that matches your CUDA version.

    console
    $ ./container/run.sh -it --framework VLLM --mount-workspace --image nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0-cuda13 -e HF_TOKEN=$HF_TOKEN
    

    The command starts an interactive container session with GPU support and passes the Hugging Face token to the container.

  3. Inside the container, create the common environment configuration file.

    console
    $ cat << 'EOF' > ~/kv_routing.env
    # Infrastructure endpoints
    export ETCD_ENDPOINTS=http://localhost:2379
    export NATS_SERVER=nats://localhost:4222
    export DYN_EVENT_PLANE=nats
    
    # Set deterministic hash for KV event IDs
    export PYTHONHASHSEED=0
    
    # KV Router Configuration
    export DYN_ROUTER_MODE=kv
    export DYN_ROUTER_TEMPERATURE=0.1
    export DYN_KV_OVERLAP_SCORE_WEIGHT=1.5
    export DYN_KV_CACHE_BLOCK_SIZE=16
    export DYN_KV_EVENTS=true
    
    # Optional: Advanced routing configuration
    export DYN_ACTIVE_DECODE_BLOCKS_THRESHOLD=0.85
    export DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD=10000
    
    echo "Environment configuration loaded:"
    echo "  etcd: $ETCD_ENDPOINTS"
    echo "  NATS: $NATS_SERVER"
    echo "  Router Mode: $DYN_ROUTER_MODE"
    EOF
    

    The common environment file contains shared configuration variables for all deployment scripts, including etcd endpoints, NATS server address, and KV routing parameters.

Configure KV Routing for Aggregated Serving

Aggregated serving combines prefill and decode phases on a single worker. KV-aware routing optimizes which worker receives each request based on cached KV blocks, improving performance even when workers handle both phases. The deployment script configures multiple vLLM workers with KV event publishing and a frontend with KV-aware routing enabled.

Deploy Aggregated Serving with KV Routing

The deployment script launches multiple vLLM workers with KV event publishing enabled and a frontend configured for KV-aware routing. The script automatically handles process cleanup and provides routing configuration feedback.

  1. Inside the container, create the aggregated serving deployment script.

    console
    $ cat << 'EOF' > ~/nemotron_agg_kv_routing.sh
    #!/bin/bash
    source ~/kv_routing.env
    
    MODEL="${1:-nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1}"
    
    pkill -f "dynamo.frontend"; pkill -f "dynamo.vllm"; sleep 2
    set -e
    trap 'kill 0' EXIT
    
    python -m dynamo.frontend --router-reset-states &
    sleep 10
    
    # Worker 1 - GPU 0 (ZMQ port 5557)
    DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT1:-8081} \
    CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
        --model "$MODEL" \
        --trust-remote-code \
        --connector none \
        --enforce-eager \
        --enable-prefix-caching \
        --kv-events-config '{"enable_kv_cache_events":true,"endpoint":"tcp://*:5557"}' &
    
    # Worker 2 - GPU 1 (ZMQ port 5558)
    DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
    CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
        --model "$MODEL" \
        --trust-remote-code \
        --connector none \
        --enforce-eager \
        --enable-prefix-caching \
        --kv-events-config '{"enable_kv_cache_events":true,"endpoint":"tcp://*:5558"}' &
    
    sleep 30
    echo "Aggregated serving ready at http://localhost:8000 (Model: $MODEL)"
    wait
    EOF
    

    This script configures aggregated serving with KV-aware routing:

    • Accepts the model name as a command-line argument (defaults to NVIDIA Nemotron Nano 4B).
    • Sources the environment configuration created earlier (includes DYN_EVENT_PLANE=nats).
    • Uses --connector none for aggregated serving (prefill+decode on same worker).
    • Uses --enforce-eager for fast deployment without compilation.
    • Enables prefix caching with --enable-prefix-caching for KV cache reuse.
    • Enables KV event publishing with ZMQ endpoints (ports 5557, 5558) on both workers.
    • Configures system ports with DYN_SYSTEM_PORT to avoid conflicts (8081, 8082).
    • Uses --trust-remote-code to support NVIDIA Nemotron models.
    Note
    KV Event Flow Architecture: vLLM natively publishes KV cache events via ZMQ. Dynamo's bridge component automatically translates these events to NATS for router consumption:
    ┌─────────────────────┐
    │ vLLM Worker 1       │
    │ (ZMQ: tcp://*:5557) │───┐
    └─────────────────────┘   │                                           ┌─────────────────────┐
                              │    ┌──────────────┐    ┌──────────────┐   │ KV Router           │
                              ├───>│ Dynamo Bridge├───>│ NATS Server  ├──>│ (Builds Radix Tree) │
                              │    └──────────────┘    └──────────────┘   │ Global Cache View   │
    ┌─────────────────────┐   │                                           └─────────────────────┘
    │ vLLM Worker 2       │   │
    │ (ZMQ: tcp://*:5558) │───┘
    └─────────────────────┘
    This architecture provides the router with a unified view of cache state across all workers while allowing vLLM to use its native ZMQ event publishing. You'll see ZMQ endpoints in worker logs, but events flow through the Dynamo bridge to NATS for router consumption.
  2. Make the script executable.

    console
    $ chmod +x ~/nemotron_agg_kv_routing.sh
    
  3. Run the aggregated serving script.

    console
    $ ~/nemotron_agg_kv_routing.sh
    

    The script starts the frontend with KV routing enabled and launches two vLLM workers on GPUs 0 and 1. The frontend connects to NATS and begins consuming KV events from all registered workers. The router builds a prefix tree (radix tree) to track cached blocks across workers.

    To deploy a different model, pass the model name as an argument:

    console
    $ ~/nemotron_agg_kv_routing.sh "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5"
    

Verify KV Routing

The router logs provide insights into worker selection and cost calculations. Verify that the router makes KV-aware decisions based on cached blocks.

  1. Open a new terminal session and send a test request to the frontend. Replace the model name if you deployed with a different model.

    console
    $ curl -X POST http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
        "messages": [
          {"role": "system", "content": "You are a helpful AI assistant."},
          {"role": "user", "content": "Explain the concept of KV cache in LLM inference."}
        ],
        "max_tokens": 100,
        "temperature": 0.7
      }'
    

    The output displays the model's response in JSON format with streaming tokens.

  2. Check the frontend logs for routing decision details. The logs display cost calculations showing the router's cost formula for each worker:

    Formula for worker_id=7587893399646425890 dp_rank=0 with 0 cached blocks: 5.188 = 1.5 * prefill_blocks + decode_blocks = 1.5 * 2.125 + 2.000
    Formula for worker_id=7587893399646425892 dp_rank=0 with 0 cached blocks: 5.188 = 1.5 * prefill_blocks + decode_blocks = 1.5 * 2.125 + 2.000
    Selected worker: worker_id=7587893399646425892 dp_rank=0, logit: 5.188, cached blocks: 0, tree size: 0, total blocks: 75379

    The router calculates the cost for each worker using the formula: cost = overlap_weight × prefill_blocks + decode_blocks. In this example, both workers have 0 cached blocks (cold start), equal prefill costs (2.125 blocks weighted by 1.5), and equal decode loads (2.000). With identical costs (5.188), the router selects a worker based on temperature sampling. The worker with the lowest cost is selected for the request.

  3. Send a second request with a similar prompt prefix.

    console
    $ curl -X POST http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
        "messages": [
          {"role": "system", "content": "You are a helpful AI assistant."},
          {"role": "user", "content": "Explain the concept of KV cache reuse in distributed LLM serving."}
        ],
        "max_tokens": 100,
        "temperature": 0.7
      }'
    
  4. Check the routing decision for the second request in the logs. If the router detects cached blocks from the previous request with a matching prompt prefix, the logs will show a reduced prefill cost:

    Formula for worker_id=7587893399646425890 dp_rank=0 with 0 cached blocks: 5.375 = 1.5 * prefill_blocks + decode_blocks = 1.5 * 2.250 + 2.000
    Formula for worker_id=7587893399646425892 dp_rank=0 with 1 cached blocks: 3.875 = 1.5 * prefill_blocks + decode_blocks = 1.5 * 1.250 + 2.000
    Selected worker: worker_id=7587893399646425892 dp_rank=0, logit: 3.875, cached blocks: 1, tree size: 8, total blocks: 75379

    The router prioritizes worker 7587893399646425892 with 1 cached block, reducing its cost from 5.375 to 3.875. The cached block reduces the prefill computation needed (1.250 blocks vs 2.250), demonstrating KV cache reuse resulting in faster Time To First Token (TTFT) for requests with overlapping prompt prefixes.

  5. Open a new terminal session on the host (outside the container) and monitor the KV event stream to verify event publishing.

    console
    $ docker run --rm -it --network container:deploy-nats-server-1 \
      natsio/nats-box nats stream info namespace-dynamo-component-backend-kv-events
    

    The output displays the JetStream statistics including message count and consumer information.

    Information for Stream namespace-dynamo-component-backend-kv-events
    ......
                    Subjects: namespace-dynamo-component-backend-kv-events.*
                    Replicas: 1
                    Storage: File
    ......
    State:
    ......
                    Messages: 8
                    Bytes: 2.3 KiB
            Active Consumers: 1
    ......

    The message count increases as workers publish KV cache events. Each message represents a KV cache state change, such as block creation or eviction.

Configure KV Routing for Disaggregated Serving

Disaggregated serving separates prefill and decode phases across different worker pools. The KV Router optimizes both prefill worker selection (based on cache hits) and decode worker assignment (based on load distribution), while coordinating KV cache transfer between phases via NIXL. The deployment script launches dedicated prefill and decode workers with appropriate batch sizes and KV event publishing enabled.

Deploy Disaggregated Serving with KV Routing

The deployment script configures the frontend with increased overlap weight for prefill optimization, launches prefill workers with low batch sizes for TTFT optimization, and decode workers with high batch sizes for throughput maximization. NIXL connector enables efficient KV cache transfer between prefill and decode workers.

  1. If the aggregated serving script is still running from the previous section, stop it by pressing Ctrl+C in the container terminal.

  2. Inside the same container, create the disaggregated serving deployment script.

    console
    $ cat << 'EOF' > ~/nemotron_disagg_kv_routing.sh
    #!/bin/bash
    source ~/kv_routing.env
    
    MODEL="${1:-nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1}"
    
    pkill -f "dynamo.frontend"; pkill -f "dynamo.vllm"; sleep 2
    set -e
    trap 'kill 0' EXIT
    
    python -m dynamo.frontend --router-reset-states &
    sleep 10
    
    # Decode Worker 1 - GPU 0
    CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
        --model "$MODEL" \
        --connector nixl \
        --is-decode-worker \
        --max-model-len 2048 \
        --enable-prefix-caching \
        --trust-remote-code &
    
    # Decode Worker 2 - GPU 1
    VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \
    CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
        --model "$MODEL" \
        --connector nixl \
        --is-decode-worker \
        --max-model-len 2048 \
        --enable-prefix-caching \
        --trust-remote-code &
    
    # Prefill Worker 1 - GPU 2 (ZMQ port 5559)
    CUDA_VISIBLE_DEVICES=2 python3 -m dynamo.vllm \
        --model "$MODEL" \
        --connector nixl \
        --is-prefill-worker \
        --max-model-len 2048 \
        --enable-prefix-caching \
        --trust-remote-code \
        --kv-events-config '{"enable_kv_cache_events":true,"endpoint":"tcp://*:5559"}' &
    
    # Prefill Worker 2 - GPU 3 (ZMQ port 5560)
    VLLM_NIXL_SIDE_CHANNEL_PORT=20099 \
    CUDA_VISIBLE_DEVICES=3 python3 -m dynamo.vllm \
        --model "$MODEL" \
        --connector nixl \
        --is-prefill-worker \
        --max-model-len 2048 \
        --enable-prefix-caching \
        --trust-remote-code \
        --kv-events-config '{"enable_kv_cache_events":true,"endpoint":"tcp://*:5560"}' &
    
    sleep 30
    echo "Disaggregated serving ready at http://localhost:8000 (Model: $MODEL)"
    echo "Architecture: 2 Decode Workers (GPU 0-1) + 2 Prefill Workers (GPU 2-3)"
    wait
    EOF
    

    This script configures disaggregated serving with KV-aware routing and NIXL for KV cache transfer:

    • Accepts the model name as a command-line argument (defaults to NVIDIA Nemotron Nano 4B).
    • Sources the environment configuration created earlier (includes DYN_EVENT_PLANE=nats).
    • Uses --connector nixl to enable NIXL for KV cache transfer between prefill and decode workers.
    • Sets unique NIXL ports via VLLM_NIXL_SIDE_CHANNEL_PORT for workers 2 and 4 (20097, 20099).
    • Enables prefix caching on all workers with --enable-prefix-caching for KV cache reuse.
    • Configures decode workers on GPUs 0-1 to receive KV cache from prefill workers via NIXL.
    • Configures prefill workers on GPUs 2-3 with ZMQ endpoints (ports 5559, 5560) for KV event publishing.
    • Uses --trust-remote-code to support NVIDIA Nemotron models.
    • Uses --max-model-len 2048 for efficient memory usage across all workers.
    • Frontend automatically detects prefill and decode workers to enable disaggregated routing.
  3. Make the script executable.

    console
    $ chmod +x ~/nemotron_disagg_kv_routing.sh
    
  4. Run the disaggregated serving script.

    console
    $ ~/nemotron_disagg_kv_routing.sh
    

    The script starts the frontend, two decode workers on GPUs 0-1, and two prefill workers on GPUs 2-3. The frontend automatically detects prefill workers and creates an internal prefill router for cache-aware prefill worker selection. The main router handles decode worker assignment based on load distribution.

    To deploy a different model, pass the model name as an argument:

    console
    $ ~/nemotron_disagg_kv_routing.sh "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5"
    

Verify Disaggregated KV Routing

Disaggregated routing involves two routing decisions: prefill worker selection based on cache overlap, and decode worker assignment based on load and available capacity. The script output shows both routing stages.

  1. Open a new terminal session and send a test request to verify the disaggregated flow. Replace the model name if you deployed with a different model.

    console
    $ curl -X POST http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
        "messages": [
          {"role": "system", "content": "You are a helpful AI assistant specialized in distributed systems."},
          {"role": "user", "content": "Explain how KV cache transfer works in disaggregated LLM serving using NIXL."}
        ],
        "max_tokens": 150,
        "temperature": 0.7
      }'
    
  2. Check the frontend logs for prefill routing decisions in the script output. The logs show prefill worker cost calculations with the higher overlap weight (1.5):

    INFO dynamo_llm::kv_router::scheduler: Formula for worker_id=7587893399646425930 dp_rank=0 with 0 cached blocks: 4.125 = 1.5 * prefill_blocks + decode_blocks = 1.5 * 2.750 + 0.000
    INFO dynamo_llm::kv_router::scheduler: Formula for worker_id=7587893399646425934 dp_rank=0 with 0 cached blocks: 4.125 = 1.5 * prefill_blocks + decode_blocks = 1.5 * 2.750 + 0.000
    INFO dynamo_llm::kv_router::scheduler: Selected worker: worker_id=7587893399646425930 dp_rank=0, logit: 4.125, cached blocks: 0, tree size: 0, total blocks: 75379

    The prefill router evaluates both prefill workers. On the first request (cold start), both workers have 0 cached blocks, resulting in equal costs (4.125). The cost formula applies the overlap weight of 1.5 to the prefill blocks (2.750), resulting in 4.125 = 1.5 * 2.750 + 0.000. The decode blocks component is 0.0 because prefill workers only handle the prefill phase.

  3. Check the decode routing decisions in the script output. After prefill completion, the decode router assigns the request to a decode worker:

    INFO dynamo_llm::kv_router::scheduler: Formula for worker_id=7587893399646425936 dp_rank=0 with 0 cached blocks: 2.000 = 0.0 * prefill_blocks + decode_blocks = 0.0 * 2.750 + 2.000
    INFO dynamo_llm::kv_router::scheduler: Selected worker: worker_id=7587893399646425936 dp_rank=0, logit: 2.000, cached blocks: 0, tree size: 0, total blocks: 75379

    The decode router selects a decode worker based on load. The prefill blocks component is 0.0 in the formula because decode workers don't compute prefills—they receive KV cache from prefill workers via NIXL transfer.

  4. Monitor NIXL transfer success in the worker logs displayed by the script. The logs confirm KV cache transfer:

    INFO nixl_connector._nixl_handshake: NIXL compatibility check passed (hash: cae818f9461cb277c305be4b8cd4aca3273e21a6ea98f7399d1a339560e1b3ea)

    NIXL successfully transfers KV cache blocks from prefill to decode workers using RDMA for low-latency, high-bandwidth data transfer.

  5. Send a second request with a similar prompt prefix to verify cache reuse.

    console
    $ curl -X POST http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
        "messages": [
          {"role": "system", "content": "You are a helpful AI assistant specialized in distributed systems."},
          {"role": "user", "content": "Explain how KV cache reuse works in disaggregated LLM serving."}
        ],
        "max_tokens": 150,
        "temperature": 0.7
      }'
    
  6. Check the prefill routing logs for the second request. The logs show improved cache hits:

    INFO dynamo_llm::kv_router::scheduler: Formula for worker_id=7587893399646425930 dp_rank=0 with 1 cached blocks: 2.250 = 1.5 * prefill_blocks + decode_blocks = 1.5 * 1.500 + 0.000
    INFO dynamo_llm::kv_router::scheduler: Formula for worker_id=7587893399646425934 dp_rank=0 with 0 cached blocks: 3.750 = 1.5 * prefill_blocks + decode_blocks = 1.5 * 2.500 + 0.000
    INFO dynamo_llm::kv_router::scheduler: Selected worker: worker_id=7587893399646425930 dp_rank=0, logit: 2.250, cached blocks: 1, tree size: 2, total blocks: 75379

    The prefill router now sees 1 cached block on worker 7587893399646425930, reducing its cost from 3.750 to 2.250. The router prioritizes this worker due to the lower cost from cache reuse, demonstrating KV-aware routing's ability to optimize TTFT by leveraging cached prompt prefixes.

Configure KV Routing with SGLang and TensorRT-LLM

The KV Router environment configuration is universal across all inference backends. The same environment variables used for vLLM deployments work identically with SGLang and TensorRT-LLM, requiring only backend-specific adjustments for block size and KV event publishing.

For complete deployment guides using these backends, see:

Universal Environment Variables

These environment variables work across all backends (vLLM, SGLang, TensorRT-LLM) without modification:

console
# Infrastructure endpoints
export ETCD_ENDPOINTS=http://localhost:2379
export NATS_SERVER=nats://localhost:4222
export DYN_EVENT_PLANE=nats
export PYTHONHASHSEED=0

# Router configuration
export DYN_ROUTER_MODE=kv
export DYN_ROUTER_TEMPERATURE=0.1
export DYN_KV_OVERLAP_SCORE_WEIGHT=1.5
export DYN_KV_EVENTS=true

# Load balancing thresholds
export DYN_ACTIVE_DECODE_BLOCKS_THRESHOLD=0.85
export DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD=10000

Backend-Specific Configuration

While the router environment variables are universal, two aspects require backend-specific configuration:

Block Size Configuration

The DYN_KV_CACHE_BLOCK_SIZE must match your backend's actual block size configuration:

Backend Typical Block Size Configuration
SGLang 16 export DYN_KV_CACHE_BLOCK_SIZE=16
TensorRT-LLM 64 or 128 (varies by model) export DYN_KV_CACHE_BLOCK_SIZE=64
vLLM 16 export DYN_KV_CACHE_BLOCK_SIZE=16
Important
Mismatched block sizes between the router and workers cause incorrect cost calculations and routing decisions. Always verify your backend's block size configuration.

KV Event Publishing Configuration

Each backend uses different command-line parameters to enable KV event publishing to the router:

Backend Parameter Example Value
SGLang --kv-events-config nats://localhost:4222
TensorRT-LLM --publish-events-and-metrics No value (flag only)
vLLM --kv-events-config '{"enable_kv_cache_events":true,"endpoint":"tcp://*:5557"}'

SGLang publishes events using a NATS URL format:

--kv-events-config nats://localhost:4222

TensorRT-LLM uses a simple flag to enable event publishing:

--publish-events-and-metrics

vLLM uses a JSON configuration format with ZMQ endpoint specification:

--kv-events-config '{"enable_kv_cache_events":true,"endpoint":"tcp://*:5557"}'

Use the backend-specific deployment guides linked at the beginning of this section along with these parameters to enable KV-aware routing for SGLang or TensorRT-LLM deployments.

Tuning KV Router Parameters

The KV Router provides several parameters to optimize routing behavior for different workload characteristics. Proper tuning balances cache reuse (TTFT optimization) with load distribution (ITL optimization).

Overlap Score Weight

The --kv-overlap-score-weight parameter controls the router's preference for cache hits versus load balancing.

Default Value: 1.0

When to Adjust:

  • Increase (1.5-2.0) for prefill-heavy workloads:

    • Long input prompts (>2000 tokens)
    • Short output generation (<200 tokens)
    • Minimizing TTFT is critical
    console
    $ python3 -m dynamo.frontend \
      --router-mode kv \
      --kv-overlap-score-weight 1.5
    
  • Decrease (0.5-0.8) for decode-heavy workloads:

    • Short input prompts (<500 tokens)
    • Long output generation (>1000 tokens)
    • Maximizing throughput is critical
    console
    $ python3 -m dynamo.frontend \
      --router-mode kv \
      --kv-overlap-score-weight 0.7
    
  • Set to 0.0 to disable cache-aware routing and use pure load balancing:

    console
    $ python3 -m dynamo.frontend \
      --router-mode kv \
      --kv-overlap-score-weight 0.0
    

    This mode ignores cached blocks and routes based solely on worker load. Useful for workloads with no prefix overlap or for baseline performance comparisons.

Router Temperature

The --router-temperature parameter introduces randomness in worker selection to prevent saturation and improve load distribution.

Default Value: 0.0 (deterministic)

When to Adjust:

  • Temperature = 0.0 (default): Always select the lowest-cost worker.

    • Best for: Maximizing cache reuse, stable workloads.
    • Risk: May oversaturate workers with the best cache state.
  • Temperature = 0.2-0.5: Add slight randomness while still favoring low-cost workers.

    • Best for: Production workloads with variable request patterns.
    • Benefit: Better load distribution without sacrificing much cache efficiency.
    console
    $ python3 -m dynamo.frontend \
      --router-mode kv \
      --router-temperature 0.3
    
  • Temperature = 1.0+: More uniform distribution across workers.

    • Best for: Debugging, testing, or when cache reuse is less important.
    • Behavior: Approaches random selection at very high temperatures.

KV Events Configuration

The --no-kv-events flag disables real-time KV event tracking and switches to prediction-based routing.

Default: KV events enabled

Prediction-Based Routing (--no-kv-events):

When enabled, the router predicts cache state based on its own routing decisions instead of receiving events from workers.

console
$ python3 -m dynamo.frontend \
  --router-mode kv \
  --no-kv-events \
  --router-ttl 120 \
  --router-max-tree-size 1048576 \
  --router-prune-target-ratio 0.8

Additional Parameters for Prediction Mode:

  • --router-ttl 120: Expire cached blocks after 120 seconds (default: 120).
  • --router-max-tree-size 1048576: Maximum blocks before pruning (default: 2^20).
  • --router-prune-target-ratio 0.8: Prune to 80% of max size when threshold exceeded.

When to Use:

  • Backend doesn't support KV event publishing.
  • Lower infrastructure overhead (no NATS dependency).
  • Simple deployments where prediction accuracy is acceptable.

Backend Configuration for Prediction Mode:

When using --no-kv-events, configure backends to disable event publishing:

  • vLLM: Use --kv-events-config '{"enable_kv_cache_events": false}'.
  • SGLang: Do not use --kv-events-config.
  • TRT-LLM: Do not use --publish-events-and-metrics.

Active Block Busy Detection

The --active-decode-blocks-threshold parameter marks workers as busy when their KV cache utilization exceeds a threshold, preventing overload.

Default: Disabled (no threshold)

Configuration:

console
$ python3 -m dynamo.frontend \
  --router-mode kv \
  --active-decode-blocks-threshold 0.85 \
  --active-prefill-tokens-threshold 10000

Parameters:

  • --active-decode-blocks-threshold 0.85: Mark worker busy at 85% KV cache utilization.
  • --active-prefill-tokens-threshold 10000: Mark worker busy when prefill tokens exceed 10,000.

When Workers Are Busy:

Busy workers are excluded from routing decisions until their load drops below the threshold. This prevents request queueing and maintains consistent latency.

Dynamic Threshold Updates:

Thresholds can be updated at runtime without restarting the frontend:

console
$ curl -X POST http://localhost:8000/busy_threshold \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
    "active_decode_blocks_threshold": 0.90,
    "active_prefill_tokens_threshold": 12000
  }'

The response confirms the updated thresholds:

{
  "model": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
  "active_decode_blocks_threshold": 0.90,
  "active_prefill_tokens_threshold": 12000
}

Router Replica Synchronization

The --router-replica-sync flag enables active block state synchronization between multiple router replicas for consistent load balancing.

Default: Disabled

When to Enable:

Launch multiple frontend replicas for fault tolerance and horizontal scaling:

console
# Router replica 1
$ python3 -m dynamo.frontend \
  --router-mode kv \
  --http-port 8000 \
  --router-replica-sync

# Router replica 2
$ python3 -m dynamo.frontend \
  --router-mode kv \
  --http-port 8001 \
  --router-replica-sync

How It Works:

  • Each replica maintains its own view of active blocks (ephemeral state).
  • Replicas exchange routing decisions via NATS core messaging.
  • Cached block state (persistent) syncs automatically via JetStream.
  • New replicas start with zero active blocks but converge quickly through sync.

Benefits:

  • Consistent routing decisions across replicas.
  • Better load distribution in multi-replica deployments.
  • Faster convergence for new replicas.

Monitor KV Router Performance

To monitor KV Router performance metrics including routing decisions, cache efficiency, and worker utilization, deploy the observability stack following the How to Enable Observability in NVIDIA Dynamo Inference Pipelines guide.

Add the KV Router environment variables from this guide to your deployment configuration, then use Grafana dashboards to visualize KV cache routing performance metrics and Prometheus to track TTFT improvements, cache hit rates, and worker selection patterns.

Conclusion

You have successfully configured KV-aware smart routing in NVIDIA Dynamo for distributed LLM inference. The KV Router optimizes worker selection by considering cached block overlap and worker load, achieving significant improvements in TTFT (30-40% reduction) and throughput (25-30% increase) compared to basic load balancing strategies. You configured routing for both aggregated and disaggregated serving patterns, tuned parameters based on workload characteristics, and implemented monitoring for production deployments. For advanced deployment patterns including multi-region routing and custom routing logic, refer to the official NVIDIA Dynamo Router documentation.

Comments