How to Deploy Inference Using NVIDIA Dynamo and TensorRT-LLM

NVIDIA Dynamo is a high-throughput inference orchestration framework that enables efficient deployment of large language models across distributed GPU infrastructure. The platform provides advanced resource management capabilities, including prefill-decode disaggregation and intelligent request routing, to maximize hardware utilization and minimize latency in production environments.
This guide outlines the deployment of NVIDIA Dynamo with TensorRT-LLM, an optimized inference engine that delivers exceptional performance through kernel fusion and memory optimization. It covers infrastructure configuration, container deployment, aggregated serving for unified workloads, and disaggregated serving that distributes prefill and decode operations across separate GPU workers.
Prerequisites
Before you begin, ensure you:
- Have access to a Linux server with NVIDIA GPUs installed and the NVIDIA Container Toolkit configured. Use a non-root user with sudo privileges.
- 1 GPU minimum for aggregated serving.
- 2 GPUs for disaggregated serving.
- Install Docker Engine and Docker Compose.
- Create a Hugging Face account and generate an access token for gated models like Llama.
Key Components
This deployment uses several components that work together to provide efficient LLM inference.
NVIDIA Dynamo serves as the orchestration layer, managing GPU resources, routing requests, and coordinating between different workers. The platform includes a frontend service that receives inference requests, a smart router that directs traffic based on KV cache awareness, and a GPU planner that dynamically adjusts resource allocation based on workload demands.
TensorRT-LLM acts as the inference backend, providing highly optimized model execution through kernel fusion, quantization support, and memory optimization. It supports two worker types: prefill workers process incoming prompts with disaggregation-mode prefill and generate initial tokens, while decode workers handle sequential token generation with disaggregation-mode decode. The TensorRT-LLM backend integrates with Dynamo through metrics reporting and cache transceiver coordination.
etcd provides distributed service discovery, allowing Dynamo components to locate and communicate with each other across the cluster. It maintains a registry of active workers and their capabilities.
NATS handles message passing between components, particularly for KV cache events. Prefill workers publish KV cache information through NATS, enabling the router to make intelligent decisions about request placement.
UCX (Unified Communication X) manages efficient data transfer between GPUs during disaggregated serving. It enables prefill workers to transfer KV cache data to decode workers with optimized inter-GPU communication through the cache transceiver backend.
Clone the Dynamo Repository
The Dynamo repository contains deployment scripts, container utilities, and orchestration modules required to run inference workloads. Clone the repository to access the TensorRT-LLM specific configurations and container runtime scripts.
Clone the repository.
console$ git clone https://github.com/ai-dynamo/dynamo.git
Navigate to the repository directory.
console$ cd dynamo
Switch to the latest stable release.
console$ git checkout release/0.9.0
The command checks out the stable release. Visit the Dynamo releases page to find the latest stable release version.
Start Infrastructure Services
Dynamo requires etcd for worker registry and NATS for KV cache event propagation between workers. The Docker Compose configuration launches both services with exposed ports (etcd: 2379-2380, NATS: 4222, 6222, 8222). These services must run continuously for request coordination.
Start the infrastructure services.
console$ docker compose -f deploy/docker-compose.yml up -d
Verify the services are running.
console$ docker compose -f deploy/docker-compose.yml ps
The output displays the running etcd and NATS containers.
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS deploy-etcd-server-1 bitnamilegacy/etcd:3.6.1 "/opt/bitnami/script…" etcd-server 58 seconds ago Up 58 seconds 0.0.0.0:2379-2380->2379-2380/tcp, [::]:2379-2380->2379-2380/tcp deploy-nats-server-1 nats:2.11.4 "/nats-server -c /et…" nats-server 58 seconds ago Up 58 seconds 0.0.0.0:4222->4222/tcp, [::]:4222->4222/tcp, 0.0.0.0:6222->6222/tcp, [::]:6222->6222/tcp, 0.0.0.0:8222->8222/tcp, [::]:8222->8222/tcp
Pull Container Image
TensorRT-LLM containers include pre-compiled kernels and optimizations for specific GPU architectures. The container runtime requires matching CUDA libraries and GPU compute capabilities to achieve optimal performance.
Pull the TensorRT-LLM container image from NGC.
console$ docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0
Visit the NVIDIA NGC Catalog to view all available image tags.Note(Optional) Build the container from source instead of pulling the pre-built image.
console$ ./container/build.sh --framework TRTLLM
The build process creates an image named
dynamo:latest-trtllm. If you prefer using this locally built image, replacenvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0withdynamo:latest-trtllmin all subsequent commands.
Configure Hugging Face Cache Permissions
The container runs as UID 1000 and requires write access to the Hugging Face cache directory for model downloads. Incorrect permissions prevent the container from accessing cached model weights, causing worker initialization failures.
Create the cache directory if it does not exist.
console$ mkdir -p container/.cache/huggingface
Set ownership to the container user (UID 1000).
console$ sudo chown -R 1000:1000 container/.cache/huggingface
Set appropriate permissions.
console$ sudo chmod -R 775 container/.cache/huggingface
Deploy Aggregated Serving
Aggregated serving combines prefill and decode phases on a single worker, eliminating inter-GPU cache transfers and minimizing latency. This architecture suits single-GPU deployments or scenarios where response time takes priority over maximum throughput.
Export your Hugging Face token to avoid rate limitations when downloading large models. Replace
YOUR_HF_TOKENwith your actual token.console$ export HF_TOKEN=YOUR_HF_TOKEN
Run the TensorRT-LLM container with GPU access and workspace mounting.
console$ ./container/run.sh -it --framework TRTLLM --mount-workspace --image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0 -e HF_TOKEN=$HF_TOKEN
The command starts an interactive container session with GPU support and passes the Hugging Face token to the container.
Inside the container, create a custom launch script for aggregated serving with the NVIDIA Nemotron model.
console$ cat << 'EOF' > ~/nemotron_agg.sh #!/bin/bash set -e trap 'echo Cleaning up...; kill 0' EXIT # Model configuration - use command line argument or default MODEL="${1:-nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1}" # Create engine configuration file CONFIG_FILE="/tmp/engine_config.yaml" cat <<EOCONFIG > "$CONFIG_FILE" tensor_parallel_size: 1 moe_expert_parallel_size: 1 enable_attention_dp: false max_num_tokens: 8192 max_batch_size: 16 trust_remote_code: true backend: pytorch enable_chunked_prefill: true kv_cache_config: free_gpu_memory_fraction: 0.85 cuda_graph_config: max_batch_size: 16 EOCONFIG echo "Starting Dynamo Frontend..." python3 -m dynamo.frontend & echo "Starting TensorRT-LLM Worker with model: $MODEL" python3 -m dynamo.trtllm \ --model-path "$MODEL" \ --served-model-name "$MODEL" \ --modality text \ --extra-engine-args "$CONFIG_FILE" EOF
Make the script executable.
console$ chmod +x ~/nemotron_agg.sh
Run the aggregated serving script.
console$ ~/nemotron_agg.shThe script starts the frontend service on port
8000and a TensorRT-LLM worker that loads the specified model (defaults to NVIDIA Nemotron Nano 4B).To deploy the larger NVIDIA Nemotron Super 49B model instead, pass the model name as an argument:
console$ ~/nemotron_agg.sh "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5"
The 49B model requires high-memory GPUs such as B200 or GB200 class devices. Ensure sufficient VRAM and consider tensor parallelism for production deployments.NoteOpen a new terminal session on your server (outside the container).
Test with a chat completion request.
console$ curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1", "messages": [{"role": "user", "content": "Hello! Tell me about AI."}], "max_tokens": 100 }'
The output displays the model's chat response in JSON format.
Deploy Disaggregated Serving
Disaggregated serving assigns prefill and decode phases to separate GPU workers, enabling independent scaling of each phase. Prefill workers process prompts and transfer KV cache data to decode workers via UCX, maximizing throughput by separating prompt processing from token generation.
Exit the container if you are still inside from the previous section. Press
Ctrl+Cto terminate the running process, then pressCtrl+Dto exit the container.Export your Hugging Face token. Replace
YOUR_HF_TOKENwith your actual token.console$ export HF_TOKEN=YOUR_HF_TOKEN
Run the container.
console$ ./container/run.sh -it --framework TRTLLM --mount-workspace --image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0 -e HF_TOKEN=$HF_TOKEN
Inside the container, create a custom launch script for disaggregated serving with the NVIDIA Nemotron model.
console$ cat << 'EOF' > ~/nemotron_disagg.sh #!/bin/bash # Kill any existing processes pkill -f "dynamo.frontend" pkill -f "dynamo.trtllm" sleep 2 # Model configuration - use command line argument or default MODEL="${1:-nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1}" # Create prefill engine configuration file CONFIG_FILE_PREFILL="/tmp/prefill_config.yaml" cat <<EOCONFIGPREFILL > "$CONFIG_FILE_PREFILL" tensor_parallel_size: 1 moe_expert_parallel_size: 1 enable_attention_dp: false max_num_tokens: 8192 trust_remote_code: true backend: pytorch enable_chunked_prefill: true disable_overlap_scheduler: true cuda_graph_config: max_batch_size: 16 kv_cache_config: free_gpu_memory_fraction: 0.85 cache_transceiver_config: backend: DEFAULT EOCONFIGPREFILL # Create decode engine configuration file CONFIG_FILE_DECODE="/tmp/decode_config.yaml" cat <<EOCONFIGDECODE > "$CONFIG_FILE_DECODE" tensor_parallel_size: 1 moe_expert_parallel_size: 1 enable_attention_dp: false max_num_tokens: 8192 trust_remote_code: true backend: pytorch enable_chunked_prefill: true disable_overlap_scheduler: false cuda_graph_config: max_batch_size: 16 kv_cache_config: free_gpu_memory_fraction: 0.85 cache_transceiver_config: backend: DEFAULT EOCONFIGDECODE echo "Starting Dynamo Frontend..." python3 -m dynamo.frontend & echo "Starting Prefill Worker with model: $MODEL" # Prefill Worker - Uses GPU 0 CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.trtllm \ --model-path "$MODEL" \ --served-model-name "$MODEL" \ --extra-engine-args "$CONFIG_FILE_PREFILL" \ --modality text \ --disaggregation-mode prefill & echo "Starting Decode Worker with model: $MODEL" # Decode Worker - Uses GPU 1 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.trtllm \ --model-path "$MODEL" \ --served-model-name "$MODEL" \ --extra-engine-args "$CONFIG_FILE_DECODE" \ --modality text \ --disaggregation-mode decode # Keep script running wait EOF
Make the script executable.
console$ chmod +x ~/nemotron_disagg.sh
Run the disaggregated serving script.
console$ ~/nemotron_disagg.shThe script starts the frontend service on port
8000, a prefill worker on GPU0, and a decode worker on GPU1with the specified model (defaults to NVIDIA Nemotron Nano 4B).To deploy the larger NVIDIA Nemotron Super 49B model instead, pass the model name as an argument:
console$ ~/nemotron_disagg.sh "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5"
The 49B model requires high-memory GPUs such as B200 or GB200 class devices. Ensure sufficient VRAM and consider tensor parallelism for production deployments.NoteOpen a new terminal session on your server (outside the container).
Test with multiple sequential requests to observe worker distribution.
console$ for i in {1..5}; do echo "Request $i:" curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d "{ \"model\": \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\", \"messages\": [{\"role\": \"user\", \"content\": \"Test request $i\"}], \"max_tokens\": 10 }" | jq '.id' sleep 1 done
Each request returns a unique ID, and the logs inside the container show which workers process each request.
Test with concurrent requests to verify load distribution.
console$ for i in {1..10}; do curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d "{ \"model\": \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\", \"messages\": [{\"role\": \"user\", \"content\": \"Concurrent test $i\"}], \"max_tokens\": 20 }" & done wait echo "All requests completed"
Dynamo's router distributes the requests across the prefill and decode workers.
Conclusion
You have successfully deployed inference workloads using NVIDIA Dynamo with TensorRT-LLM. The aggregated serving configuration provides a simple, single-GPU deployment suitable for low-latency applications, while the disaggregated serving configuration optimizes throughput by separating prefill and decode phases across multiple GPUs. Dynamo's intelligent routing and resource management maximize GPU utilization and token generation efficiency. For more advanced configurations, including KV-aware routing, multimodal support, and Kubernetes deployments, refer to the official NVIDIA Dynamo documentation.