Inference Cookbook

Updated on 12 March, 2026

Comprehensive inference cookbook for running large language models on NVIDIA HGX B200 and AMD Instinct GPUs using vLLM.

Deploy and optimize large language models on NVIDIA HGX B200 GPUs with vLLM. This cookbook provides tested configurations, benchmark data, and optimization strategies for serving frontier models on NVIDIA HGX B200 infrastructure.

Getting Started

Get started with NVIDIA HGX B200 GPUs, including hardware overview, environment setup, and first model deployment.

Model Guides

Deploy leading AI models including Nemotron, DeepSeek, GLM, and MiniMax on NVIDIA HGX B200 GPUs with optimized inference configurations.

Dynamo

Explore NVIDIA Dynamo’s architecture for disaggregated LLM inference, including routing, KV cache tiering, and optimized deployment with vLLM.

Optimization

Optimize LLM inference on NVIDIA HGX B200 GPUs with KV cache management, quantization, kernel tuning, and concurrency optimization.

Benchmarks

Benchmark methodology and performance results for LLM inference workloads on NVIDIA HGX B200 GPUs.

Troubleshooting

Common issues encountered when running vLLM on NVIDIA HGX B200 GPUs and their solutions.

Production Deployment

Guidelines for deploying vLLM on NVIDIA HGX B200 instances in production.

Cookbook for ROCm

Production-ready deployment guide for running large language models on AMD Instinct MI325X GPUs using vLLM.

Getting Started

Get started running vLLM on AMD Instinct GPUs with hardware requirements, environment setup, and your first model deployment.

Model Guides

Deploy large AI models including DeepSeek V3.2, Llama 3.1, Qwen3-VL, and Kimi-K2.5 on AMD Instinct GPUs.

Optimization

Improve LLM inference performance on AMD Instinct GPUs with FP8 quantization, KV cache optimization, and concurrency tuning.

Benchmarks

Detailed benchmarking of DeepSeek, Llama, Qwen3-VL, and Kimi models on AMD Instinct MI325X GPUs with stress and validation testing.

Troubleshooting

Common issues and verified solutions for vLLM on AMD Instinct GPUs.

Production Deployment

Deploy vLLM with health checks, monitoring, and resilience on AMD Instinct GPUs.

Inference Cookbook

Getting Started

Model Guides

Dynamo

Optimization

Benchmarks

Troubleshooting

Production Deployment

Getting Started

Model Guides

Optimization

Benchmarks

Troubleshooting

Production Deployment

Products

Features

Solutions

Marketplace

Resources

Company