Llama 3.1 (405B)

Updated on 17 March, 2026

Deploy Meta's Llama-3.1-405B-Instruct on AMD Instinct GPUs.


Model Overview

Property Value
Model ID meta-llama/Llama-3.1-405B-Instruct
Architecture Dense Transformer with GQA
Total Parameters 405B
Context Length 128,000 tokens
License Llama 3.1 Community License

Quick Start

bash
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768
Note
Llama-3.1-405B requires accepting Meta's license agreement on HuggingFace. Visit the model page to request access.

Memory Usage

Metric Value
Model Memory (FP8) ~210 GB (across 8 GPUs)
Per GPU ~26 GB
Load Time ~60 seconds

Performance (MI325X Verified)

Concurrency Scaling

Concurrent Throughput p99 Latency
10 1,090 tok/s 16.17s
50 4,381 tok/s 20.14s
100 6,802 tok/s 25.84s
200 6,674 tok/s 26.33s
500 6,804 tok/s 25.84s

Multi-run means (n=5).

Peak Performance

  • Peak Throughput: 6,808 tok/s at 750 concurrent
  • Saturation Point: ~100 concurrent requests
  • 100% Success Rate at all tested concurrency levels

See Llama-3.1-405B Stress Testing for detailed results.

Test Endpoints

Chat Completion

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-405B-Instruct",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 500
  }'

Text Completion

bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-405B-Instruct",
    "prompt": "The key benefits of renewable energy are:",
    "max_tokens": 200
  }'

Configuration Variants

Maximum Context Length

For 128K context (requires more memory):

bash
--max-model-len 131072

Lower Memory Usage

For tighter memory constraints:

bash
--max-model-len 16384 \n--gpu-memory-utilization 0.85
Use Case Concurrency Expected Throughput p99 Latency
Low latency 10 ~1,090 tok/s ~16s
Balanced 50 ~4,381 tok/s ~20s
High throughput 100-200 ~6,700-6,800 tok/s ~26s
Maximum throughput 750 ~6,808 tok/s ~26s

Troubleshooting

OOM During Loading

Reduce context length:

bash
--max-model-len 16384

Slow Generation

Enable FP8 quantization for better performance:

bash
--quantization fp8 \n--kv-cache-dtype fp8

Comments