Deploy your first model on AMD Instinct GPUs.
Start with a small model to verify your setup:
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-0.6B
Wait for the server to start (look for "Application startup complete").
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 100
}'
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"prompt": "The capital of France is",
"max_tokens": 20
}'
For large models, use tensor parallelism across all 8 GPUs:
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--quantization fp8 \
--max-model-len 32768
| Flag | Purpose | Example |
|---|---|---|
--tensor-parallel-size |
Distribute across GPUs | --tensor-parallel-size 8 |
--quantization |
Model quantization | --quantization fp8 |
--max-model-len |
Maximum context length | --max-model-len 32768 |
--gpu-memory-utilization |
VRAM allocation | --gpu-memory-utilization 0.90 |
For production deployments, add these flags:
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--env "VLLM_ROCM_USE_AITER=1" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model MODEL_NAME \
--tensor-parallel-size 8 \
--quantization fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0