Focus Mode

Environment Setup

Updated on 17 March, 2026

Configure your system for running vLLM on AMD Instinct GPUs.

Docker Setup (Recommended)

The simplest approach is using pre-built Docker images.

Pull the vLLM Image

                            bash
                            
                        
# Latest stable release (recommended)
docker pull vllm/vllm-openai-rocm:latest

# Or specific ROCm version for reproducibility
docker pull rocm/vllm:rocm7.0.0_vllm_0.11.2_20251210

Note

If you experience MoE performance regressions or crashes with DeepSeek/Qwen models, use this verified image:

bash

docker pull rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103

See AITER Configuration - Troubleshooting for details.

Base Docker Command

                            bash
                            
                        
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model MODEL_NAME

Docker Flags Explained

Flag	Purpose
`--device /dev/kfd`	GPU compute interface
`--device /dev/dri`	GPU render interface
`--group-add=video`	GPU access permissions
`--ipc=host`	Shared memory for multi-GPU
`--security-opt seccomp=unconfined`	Required for ROCm
`-v ~/.cache/huggingface:/root/.cache/huggingface`	Cache model downloads

Environment Variables

Essential Variables

                            bash
                            
                        
# Enable AITER optimizations (recommended)
export VLLM_ROCM_USE_AITER=1

# HuggingFace token for gated models
export HF_TOKEN=your_token_here

Model-Specific Variables

                            bash
                            
                        
# For DeepSeek V3.2 (prevents crashes)
export AITER_ENABLE_VSKIP=0

# For Vision-Language models
export VLLM_USE_TRITON_FLASH_ATTN=0

Multi-GPU Optimizations

                            bash
                            
                        
# Increase RCCL channels
export NCCL_MIN_NCHANNELS=112

# High priority RCCL streams
export TORCH_NCCL_HIGH_PRIORITY=1

# Kernel argument optimization
export HIP_FORCE_DEV_KERNARG=1

# Prefer hipBLASLt for GEMM operations
export TORCH_BLAS_PREFER_HIPBLASLT=1

System Configuration

Disable NUMA Balancing

                            bash
                            
                        
# Check current status
cat /proc/sys/kernel/numa_balancing

# Disable if enabled (returns 1)
sudo sysctl kernel.numa_balancing=0

# Make persistent
echo "kernel.numa_balancing=0" | sudo tee /etc/sysctl.d/99-numa.conf

GPU User Permissions

                            bash
                            
# Add user to video and render groups
sudo usermod -aG video,render $USER

# Log out and back in for changes to take effect

Verification

Test GPU Access

                            bash
                            
                        
# Verify ROCm sees all GPUs
rocm-smi --showproductname

# Check memory
rocm-smi --showmeminfo vram

Test vLLM

                            bash
                            
                        
# Start with a small model
docker run --rm \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --ipc=host \
  -p 8000:8000 \
  vllm/vllm-openai-rocm:latest \
  --model Qwen/Qwen3-0.6B

# Test the endpoint
curl http://localhost:8000/v1/models

Docker Compose Template

For production deployments:

                            yaml
                            
                        
version: '3.8'
services:
  vllm:
    image: vllm/vllm-openai-rocm:latest
    ipc: host
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
    security_opt:
      - seccomp:unconfined
    cap_add:
      - SYS_PTRACE
    ports:
      - "8000:8000"
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_ROCM_USE_AITER=1
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model deepseek-ai/DeepSeek-V3.2
      --tensor-parallel-size 8
      --quantization fp8

Troubleshooting

GPU Not Visible

                            bash
                            
                        
# Check device permissions
ls -la /dev/kfd /dev/dri

# Add user to groups
sudo usermod -aG video,render $USER

Multi-GPU Hanging

                            bash
                            
                        
# Verify NUMA balancing is disabled
cat /proc/sys/kernel/numa_balancing
# Should return 0

# Check RCCL communication
RCCL_DEBUG=INFO python -c "import torch.distributed"

OOM Errors

                            bash
                            
                        
# Reduce memory utilization
--gpu-memory-utilization 0.85

# Enable FP8 quantization
--quantization fp8