This guide explains the methodology used for all benchmark results in this documentation, and provides the scripts to reproduce them.
Our unified benchmark follows a progressive approach to stress testing:
WARMUP → BASELINE → SCALING → STRESS → SATURATION| Phase | Purpose | What It Measures |
|---|---|---|
| WARMUP | Initialize model, warm caches | Single request baseline |
| BASELINE | Establish clean metrics | p99 latency at concurrency=1 |
| SCALING | Find optimal concurrency | Throughput vs latency trade-off |
| STRESS | Test edge cases | Long context, long output, multi-image |
| SATURATION | Find breaking point | Maximum sustainable load |
| Status | Meaning | Threshold |
|---|---|---|
| OK | Normal operation | Success rate ≥95%, latency acceptable |
| DEGRADED | Latency increased | p99 > 2x baseline p99 |
| SATURATED | Throughput plateau | Throughput < 1.05x previous level |
| FAILING | Requests failing | Success rate < 95% |
Note: DEGRADED status is expected under concurrent load - it indicates the latency trade-off for higher throughput, not a problem.
| Mode | Multiplier | Use Case |
|---|---|---|
--quick |
0.5x | Fast validation, CI/CD |
| (default) | 1x | Standard benchmarking |
--thorough |
3x | Production capacity planning |
The unified benchmark script runs all phases automatically:
#!/usr/bin/env python3
"""
Unified Progressive Benchmark for vLLM
Usage:
# Quick validation test
python unified_benchmark.py --model deepseek-ai/DeepSeek-V3.2 --quick
# Standard benchmark
python unified_benchmark.py --model meta-llama/Llama-3.1-405B-Instruct
# Thorough stress test
python unified_benchmark.py --model Qwen/Qwen3-VL-235B-A22B-Instruct --vision --thorough
Options:
--model Model name (required)
--vision Enable vision tests for VLMs
--quick 0.5x requests (fast validation)
--thorough 3x requests (comprehensive stress test)
--skip-saturation Skip extreme load tests
--completions Use /v1/completions API instead of chat
"""
import argparse
import json
import requests
import time
import concurrent.futures
from datetime import datetime
from pathlib import Path
from dataclasses import dataclass, asdict
RESULTS_DIR = Path("./results")
TEST_IMAGES = [
"https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=1280",
"https://images.unsplash.com/photo-1506905925346-21bda4d32df4?w=1280",
"https://images.unsplash.com/photo-1504674900247-0877df9cc836?w=1280",
]
@dataclass
class TestResult:
phase: str
test_name: str
test_type: str
concurrency: int
num_requests: int
successful: int
failed: int
success_rate: float
duration_sec: float
latency_p50: float
latency_p95: float
latency_p99: float
throughput_total: float
throughput_output: float
status: str # OK, DEGRADED, SATURATED, FAILING
def make_request(base_url, model, prompt, max_tokens, image_urls=None):
"""Make a single API request."""
start = time.time()
try:
if image_urls:
content = [{"type": "text", "text": prompt}]
for url in image_urls:
content.append({"type": "image_url", "image_url": {"url": url}})
messages = [{"role": "user", "content": content}]
else:
messages = [{"role": "user", "content": prompt}]
resp = requests.post(
f"{base_url}/v1/chat/completions",
json={"model": model, "messages": messages, "max_tokens": max_tokens},
timeout=300
)
elapsed = time.time() - start
if resp.status_code == 200:
data = resp.json()
return {
"success": True,
"latency": elapsed,
"prompt_tokens": data.get('usage', {}).get('prompt_tokens', 0),
"completion_tokens": data.get('usage', {}).get('completion_tokens', 0),
}
return {"success": False, "latency": elapsed}
except Exception as e:
return {"success": False, "latency": time.time() - start}
def run_test(base_url, model, phase, test_name, concurrency, num_requests,
input_tokens, output_tokens, images_per_request=0,
baseline_p99=None, prev_throughput=None):
"""Run a test with given parameters."""
# Generate prompt
prompt = "Explain in detail: " * (input_tokens // 4)
# Run concurrent requests
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = []
for i in range(num_requests):
img_urls = TEST_IMAGES[:images_per_request] if images_per_request else None
futures.append(executor.submit(
make_request, base_url, model, prompt, output_tokens, img_urls
))
for future in concurrent.futures.as_completed(futures):
results.append(future.result())
# Calculate metrics
successful = [r for r in results if r["success"]]
latencies = sorted([r["latency"] for r in successful])
def percentile(data, p):
k = (len(data) - 1) * p / 100
f = int(k)
return data[f] if f == int(k) else data[f] + (k - f) * (data[min(f+1, len(data)-1)] - data[f])
total_tokens = sum(r.get("prompt_tokens", 0) + r.get("completion_tokens", 0) for r in successful)
duration = max(r["latency"] for r in results) if results else 1
throughput = total_tokens / duration
p99 = percentile(latencies, 99) if latencies else 0
success_rate = len(successful) / num_requests
# Determine status
if success_rate < 0.95:
status = "FAILING"
elif baseline_p99 and p99 > baseline_p99 * 2:
status = "DEGRADED"
elif prev_throughput and throughput < prev_throughput * 1.05:
status = "SATURATED"
else:
status = "OK"
return TestResult(
phase=phase, test_name=test_name, test_type="vision" if images_per_request else "text",
concurrency=concurrency, num_requests=num_requests,
successful=len(successful), failed=len(results) - len(successful),
success_rate=success_rate, duration_sec=duration,
latency_p50=percentile(latencies, 50) if latencies else 0,
latency_p95=percentile(latencies, 95) if latencies else 0,
latency_p99=p99,
throughput_total=throughput,
throughput_output=sum(r.get("completion_tokens", 0) for r in successful) / duration,
status=status
)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--base-url", default="http://localhost:8000")
parser.add_argument("--model", required=True)
parser.add_argument("--vision", action="store_true")
parser.add_argument("--quick", action="store_true")
parser.add_argument("--thorough", action="store_true")
args = parser.parse_args()
multiplier = 0.5 if args.quick else 3 if args.thorough else 1
all_results = []
# Phase 1: Warmup
print("Phase 1: WARMUP")
r = run_test(args.base_url, args.model, "WARMUP", "single", 1, 1, 100, 50)
all_results.append(r)
# Phase 2: Baseline
print("Phase 2: BASELINE")
r = run_test(args.base_url, args.model, "BASELINE", "baseline", 1, int(10*multiplier), 500, 100)
all_results.append(r)
baseline_p99 = r.latency_p99
# Phase 3: Scaling
print("Phase 3: SCALING")
prev_throughput = None
for conc in [5, 10, 25, 50, 75, 100]:
r = run_test(args.base_url, args.model, "SCALING", f"scale_{conc}",
conc, int(conc*2*multiplier), 500, 200,
baseline_p99=baseline_p99, prev_throughput=prev_throughput)
all_results.append(r)
prev_throughput = r.throughput_total
if r.status in ["SATURATED", "FAILING"]:
break
# Phase 4: Stress Tests
print("Phase 4: STRESS")
r = run_test(args.base_url, args.model, "STRESS", "long_output", 50, int(30*multiplier), 200, 500)
all_results.append(r)
r = run_test(args.base_url, args.model, "STRESS", "long_context", 25, int(15*multiplier), 4000, 200)
all_results.append(r)
# Phase 5: Saturation
print("Phase 5: SATURATION")
prev_throughput = None
for conc in [150, 200, 300, 500]:
r = run_test(args.base_url, args.model, "SATURATION", f"extreme_{conc}",
conc, int(conc*multiplier), 500, 100, prev_throughput=prev_throughput)
all_results.append(r)
prev_throughput = r.throughput_total
# Save results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
RESULTS_DIR.mkdir(exist_ok=True)
with open(RESULTS_DIR / f"benchmark_{timestamp}.json", "w") as f:
json.dump({"results": [asdict(r) for r in all_results]}, f, indent=2)
print(f"\nResults saved to results/benchmark_{timestamp}.json")
if __name__ == "__main__":
main()
#!/bin/bash
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--env "VLLM_ROCM_USE_AITER=1" \
--env "AITER_ENABLE_VSKIP=0" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--block-size 1 \
--quantization fp8#!/bin/bash
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--quantization fp8#!/bin/bash
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
--env "VLLM_USE_TRITON_FLASH_ATTN=0" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-VL-235B-A22B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--kv-offloading-backend native \
--kv-offloading-size 64 \
--disable-hybrid-kv-cache-manager# 1. Start vLLM server
./start_deepseek_v32.sh
# 2. Wait for model to load (check logs for "Application startup complete")
# 3. Run validation test (quick)
python unified_benchmark.py --model deepseek-ai/DeepSeek-V3.2 --quick
# 4. Run full stress test (thorough)
python unified_benchmark.py --model deepseek-ai/DeepSeek-V3.2 --thorough
# For vision-language models like Qwen3-VL
python unified_benchmark.py \
--model Qwen/Qwen3-VL-235B-A22B-Instruct \
--vision \
--thorough
Each benchmark run produces:
results/unified_YYYYMMDD_HHMMSS.json - Raw data for analysisresults/unified_YYYYMMDD_HHMMSS.md - Human-readable report| Metric | Description |
|---|---|
| throughput_total | Total tokens (input + output) per second |
| throughput_output | Output tokens per second (generation speed) |
| latency_p99 | 99th percentile request latency |
| success_rate | Percentage of successful requests |
During high-throughput LLM inference, you may observe low GPU compute utilization (5-10%) in rocm-smi despite excellent throughput. This is expected because:
Monitor rocm-smi --showmeminfo vram and memory bandwidth instead of compute utilization.
Our testing revealed distinct performance profiles for different architectures:
| Architecture | Example | Strengths | Weaknesses |
|---|---|---|---|
| MoE + GQA | Qwen3-VL | Highest throughput, best scaling | Requires KV offloading workaround |
| Dense + GQA | Llama-405B | Most linear scaling, predictable | Lower absolute throughput |
| MoE + MLA | DeepSeek V3.2 | Memory efficient, good single-request latency | Scaling plateaus earlier, no KV offloading |
When to choose each:
All benchmarks in this documentation were run on:
| Specification | Value |
|---|---|
| GPU | 8x AMD Instinct MI325X |
| VRAM | 256 GB HBM3E per GPU (2 TB total) |
| Architecture | CDNA 3 (gfx942) |
| ROCm | 6.4.2-120 |
| vLLM | 0.14.1 |