Comprehensive inference cookbook for running large language models on NVIDIA HGX B200 and AMD Instinct GPUs using vLLM.
Get started with NVIDIA HGX B200 GPUs, including hardware overview, environment setup, and first model deployment.
Deploy leading AI models including Nemotron, DeepSeek, GLM, and MiniMax on NVIDIA HGX B200 GPUs with optimized inference configurations.
Explore NVIDIA Dynamo’s architecture for disaggregated LLM inference, including routing, KV cache tiering, and optimized deployment with vLLM.
Optimize LLM inference on NVIDIA HGX B200 GPUs with KV cache management, quantization, kernel tuning, and concurrency optimization.
Benchmark methodology and performance results for LLM inference workloads on NVIDIA HGX B200 GPUs.
Common issues encountered when running vLLM on NVIDIA HGX B200 GPUs and their solutions.
Guidelines for deploying vLLM on NVIDIA HGX B200 instances in production.
Get started running vLLM on AMD Instinct GPUs with hardware requirements, environment setup, and your first model deployment.
Deploy large AI models including DeepSeek V3.2, Llama 3.1, Qwen3-VL, and Kimi-K2.5 on AMD Instinct GPUs.
Improve LLM inference performance on AMD Instinct GPUs with FP8 quantization, KV cache optimization, and concurrency tuning.
Detailed benchmarking of DeepSeek, Llama, Qwen3-VL, and Kimi models on AMD Instinct MI325X GPUs with stress and validation testing.
Common issues and verified solutions for vLLM on AMD Instinct GPUs.
Deploy vLLM with health checks, monitoring, and resilience on AMD Instinct GPUs.