Llama 4 Scout 17B 16E Instruct icon

Llama 4 Scout 17B 16E Instruct

NVIDIA
Llama-4-Scout-17B-16E-Instruct is an instruction-tuned, natively multimodal auto-regressive model developed by Meta, built on a mixture-of-experts architecture with 17B activated parameters (109B total). Designed for assistant-style and vision-language workloads, it supports multilingual text and image inputs and produces multilingual text and code outputs. The model features a native 10 million token context window, enabling extremely long-context analysis across documents, conversations, and multimodal inputs. It was trained on a mixture of publicly available and licensed data, including data from Meta’s products, with a knowledge cutoff of August 2024, and officially released on April 5, 2025.
TypeVision-Language Model
CapabilitiesText Generation, Instruction Following, Reasoning, Mathematical Reasoning+6 more
Group Release DateApril 4, 2025
Links
LicenseLlama4

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE
docker run -it --rm 
 --runtime=nvidia 
 --gpus all 
 --ipc=host 
 --shm-size=128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 -e LD_LIBRARY_PATH='/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu' 
 vllm/vllm-openai:v0.15.0-cu130 
 meta-llama/Llama-4-Scout-17B-16E-Instruct 
  --tensor-parallel-size 8 
 --max-model-len auto 
  --max-num-batched-tokens 65536 
  --gpu-memory-utilization 0.95 
  --max-num-seqs 1024 
 --disable-log-requests 
 --trust-remote-code

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

Vultr Cloud GPU

NVIDIA HGX B200

Deploy NVIDIA B200 on Vultr Cloud GPU

How to Deploy Llama 4 Scout 17B 16E Instruct on NVIDIA GPUs | Vultr Docs