Focus Mode

Llama 3.1 (405B)

Updated on 17 March, 2026

Deploy Meta's Llama-3.1-405B-Instruct on AMD Instinct GPUs.

Model Overview

Property	Value
Model ID	`meta-llama/Llama-3.1-405B-Instruct`
Architecture	Dense Transformer with GQA
Total Parameters	405B
Context Length	128,000 tokens
License	Llama 3.1 Community License

Quick Start

                            bash
                            
                        
docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:latest \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768

Note

Llama-3.1-405B requires accepting Meta's license agreement on HuggingFace. Visit the model page to request access.

Memory Usage

Metric	Value
Model Memory (FP8)	~210 GB (across 8 GPUs)
Per GPU	~26 GB
Load Time	~60 seconds

Performance (MI325X Verified)

Concurrency Scaling

Concurrent	Throughput	p99 Latency
10	1,090 tok/s	16.17s
50	4,381 tok/s	20.14s
100	6,802 tok/s	25.84s
200	6,674 tok/s	26.33s
500	6,804 tok/s	25.84s

Multi-run means (n=5).

Peak Performance

Peak Throughput: 6,808 tok/s at 750 concurrent
Saturation Point: ~100 concurrent requests
100% Success Rate at all tested concurrency levels

See Llama-3.1-405B Stress Testing for detailed results.

Test Endpoints

Chat Completion

                            bash
                            
                        
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-405B-Instruct",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 500
  }'

Text Completion

                            bash
                            
                        
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-405B-Instruct",
    "prompt": "The key benefits of renewable energy are:",
    "max_tokens": 200
  }'

Configuration Variants

Maximum Context Length

For 128K context (requires more memory):

bash

--max-model-len 131072

Lower Memory Usage

For tighter memory constraints:

                            bash
                            
--max-model-len 16384 \n--gpu-memory-utilization 0.85

Recommended Settings

Use Case	Concurrency	Expected Throughput	p99 Latency
Low latency	10	~1,090 tok/s	~16s
Balanced	50	~4,381 tok/s	~20s
High throughput	100-200	~6,700-6,800 tok/s	~26s
Maximum throughput	750	~6,808 tok/s	~26s

Troubleshooting

OOM During Loading

Reduce context length:

bash

--max-model-len 16384

Slow Generation

Enable FP8 quantization for better performance:

                            bash
                            
--quantization fp8 \n--kv-cache-dtype fp8