Deploy Meta's Llama-3.1-405B-Instruct on AMD Instinct GPUs.
| Property | Value |
|---|---|
| Model ID | meta-llama/Llama-3.1-405B-Instruct |
| Architecture | Dense Transformer with GQA |
| Total Parameters | 405B |
| Context Length | 128,000 tokens |
| License | Llama 3.1 Community License |
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 32768
| Metric | Value |
|---|---|
| Model Memory (FP8) | ~210 GB (across 8 GPUs) |
| Per GPU | ~26 GB |
| Load Time | ~60 seconds |
| Concurrent | Throughput | p99 Latency |
|---|---|---|
| 10 | 1,090 tok/s | 16.17s |
| 50 | 4,381 tok/s | 20.14s |
| 100 | 6,802 tok/s | 25.84s |
| 200 | 6,674 tok/s | 26.33s |
| 500 | 6,804 tok/s | 25.84s |
Multi-run means (n=5).
See Llama-3.1-405B Stress Testing for detailed results.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-405B-Instruct",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
"max_tokens": 500
}'
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-405B-Instruct",
"prompt": "The key benefits of renewable energy are:",
"max_tokens": 200
}'
For 128K context (requires more memory):
--max-model-len 131072
For tighter memory constraints:
--max-model-len 16384 \n--gpu-memory-utilization 0.85
| Use Case | Concurrency | Expected Throughput | p99 Latency |
|---|---|---|---|
| Low latency | 10 | ~1,090 tok/s | ~16s |
| Balanced | 50 | ~4,381 tok/s | ~20s |
| High throughput | 100-200 | ~6,700-6,800 tok/s | ~26s |
| Maximum throughput | 750 | ~6,808 tok/s | ~26s |
Reduce context length:
--max-model-len 16384
Enable FP8 quantization for better performance:
--quantization fp8 \n--kv-cache-dtype fp8