Benchmarking Guide

Updated on 11 March, 2026

This guide explains the methodology used for all benchmark results in this documentation, and provides the scripts to reproduce them.


Methodology

Our unified benchmark follows a progressive approach to stress testing:

WARMUP → BASELINE → SCALING → STRESS → SATURATION

Phase Overview

Phase Purpose What It Measures
WARMUP Initialize model, warm caches Single request baseline
BASELINE Establish clean metrics p99 latency at concurrency=1
SCALING Find optimal concurrency Throughput vs latency trade-off
STRESS Test edge cases Long context, long output, multi-image
SATURATION Find breaking point Maximum sustainable load

Status Indicators

Status Meaning Threshold
OK Normal operation Success rate ≥95%, latency acceptable
DEGRADED Latency increased p99 > 2x baseline p99
SATURATED Throughput plateau Throughput < 1.05x previous level
FAILING Requests failing Success rate < 95%

Note: DEGRADED status is expected under concurrent load - it indicates the latency trade-off for higher throughput, not a problem.

Test Modes

Mode Multiplier Use Case
--quick 0.5x Fast validation, CI/CD
(default) 1x Standard benchmarking
--thorough 3x Production capacity planning

Benchmark Script

The unified benchmark script runs all phases automatically:

python
#!/usr/bin/env python3
"""
Unified Progressive Benchmark for vLLM

Usage:
  # Quick validation test
  python unified_benchmark.py --model deepseek-ai/DeepSeek-V3.2 --quick

  # Standard benchmark
  python unified_benchmark.py --model meta-llama/Llama-3.1-405B-Instruct

  # Thorough stress test
  python unified_benchmark.py --model Qwen/Qwen3-VL-235B-A22B-Instruct --vision --thorough

Options:
  --model         Model name (required)
  --vision        Enable vision tests for VLMs
  --quick         0.5x requests (fast validation)
  --thorough      3x requests (comprehensive stress test)
  --skip-saturation   Skip extreme load tests
  --completions   Use /v1/completions API instead of chat
"""

import argparse
import json
import requests
import time
import concurrent.futures
from datetime import datetime
from pathlib import Path
from dataclasses import dataclass, asdict

RESULTS_DIR = Path("./results")

TEST_IMAGES = [
    "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=1280",
    "https://images.unsplash.com/photo-1506905925346-21bda4d32df4?w=1280",
    "https://images.unsplash.com/photo-1504674900247-0877df9cc836?w=1280",
]

@dataclass
class TestResult:
    phase: str
    test_name: str
    test_type: str
    concurrency: int
    num_requests: int
    successful: int
    failed: int
    success_rate: float
    duration_sec: float
    latency_p50: float
    latency_p95: float
    latency_p99: float
    throughput_total: float
    throughput_output: float
    status: str  # OK, DEGRADED, SATURATED, FAILING


def make_request(base_url, model, prompt, max_tokens, image_urls=None):
    """Make a single API request."""
    start = time.time()
    try:
        if image_urls:
            content = [{"type": "text", "text": prompt}]
            for url in image_urls:
                content.append({"type": "image_url", "image_url": {"url": url}})
            messages = [{"role": "user", "content": content}]
        else:
            messages = [{"role": "user", "content": prompt}]

        resp = requests.post(
            f"{base_url}/v1/chat/completions",
            json={"model": model, "messages": messages, "max_tokens": max_tokens},
            timeout=300
        )
        elapsed = time.time() - start

        if resp.status_code == 200:
            data = resp.json()
            return {
                "success": True,
                "latency": elapsed,
                "prompt_tokens": data.get('usage', {}).get('prompt_tokens', 0),
                "completion_tokens": data.get('usage', {}).get('completion_tokens', 0),
            }
        return {"success": False, "latency": elapsed}
    except Exception as e:
        return {"success": False, "latency": time.time() - start}


def run_test(base_url, model, phase, test_name, concurrency, num_requests,
             input_tokens, output_tokens, images_per_request=0,
             baseline_p99=None, prev_throughput=None):
    """Run a test with given parameters."""
    # Generate prompt
    prompt = "Explain in detail: " * (input_tokens // 4)

    # Run concurrent requests
    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
        futures = []
        for i in range(num_requests):
            img_urls = TEST_IMAGES[:images_per_request] if images_per_request else None
            futures.append(executor.submit(
                make_request, base_url, model, prompt, output_tokens, img_urls
            ))
        for future in concurrent.futures.as_completed(futures):
            results.append(future.result())

    # Calculate metrics
    successful = [r for r in results if r["success"]]
    latencies = sorted([r["latency"] for r in successful])

    def percentile(data, p):
        k = (len(data) - 1) * p / 100
        f = int(k)
        return data[f] if f == int(k) else data[f] + (k - f) * (data[min(f+1, len(data)-1)] - data[f])

    total_tokens = sum(r.get("prompt_tokens", 0) + r.get("completion_tokens", 0) for r in successful)
    duration = max(r["latency"] for r in results) if results else 1
    throughput = total_tokens / duration

    p99 = percentile(latencies, 99) if latencies else 0
    success_rate = len(successful) / num_requests

    # Determine status
    if success_rate < 0.95:
        status = "FAILING"
    elif baseline_p99 and p99 > baseline_p99 * 2:
        status = "DEGRADED"
    elif prev_throughput and throughput < prev_throughput * 1.05:
        status = "SATURATED"
    else:
        status = "OK"

    return TestResult(
        phase=phase, test_name=test_name, test_type="vision" if images_per_request else "text",
        concurrency=concurrency, num_requests=num_requests,
        successful=len(successful), failed=len(results) - len(successful),
        success_rate=success_rate, duration_sec=duration,
        latency_p50=percentile(latencies, 50) if latencies else 0,
        latency_p95=percentile(latencies, 95) if latencies else 0,
        latency_p99=p99,
        throughput_total=throughput,
        throughput_output=sum(r.get("completion_tokens", 0) for r in successful) / duration,
        status=status
    )


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--base-url", default="http://localhost:8000")
    parser.add_argument("--model", required=True)
    parser.add_argument("--vision", action="store_true")
    parser.add_argument("--quick", action="store_true")
    parser.add_argument("--thorough", action="store_true")
    args = parser.parse_args()

    multiplier = 0.5 if args.quick else 3 if args.thorough else 1
    all_results = []

    # Phase 1: Warmup
    print("Phase 1: WARMUP")
    r = run_test(args.base_url, args.model, "WARMUP", "single", 1, 1, 100, 50)
    all_results.append(r)

    # Phase 2: Baseline
    print("Phase 2: BASELINE")
    r = run_test(args.base_url, args.model, "BASELINE", "baseline", 1, int(10*multiplier), 500, 100)
    all_results.append(r)
    baseline_p99 = r.latency_p99

    # Phase 3: Scaling
    print("Phase 3: SCALING")
    prev_throughput = None
    for conc in [5, 10, 25, 50, 75, 100]:
        r = run_test(args.base_url, args.model, "SCALING", f"scale_{conc}",
                     conc, int(conc*2*multiplier), 500, 200,
                     baseline_p99=baseline_p99, prev_throughput=prev_throughput)
        all_results.append(r)
        prev_throughput = r.throughput_total
        if r.status in ["SATURATED", "FAILING"]:
            break

    # Phase 4: Stress Tests
    print("Phase 4: STRESS")
    r = run_test(args.base_url, args.model, "STRESS", "long_output", 50, int(30*multiplier), 200, 500)
    all_results.append(r)
    r = run_test(args.base_url, args.model, "STRESS", "long_context", 25, int(15*multiplier), 4000, 200)
    all_results.append(r)

    # Phase 5: Saturation
    print("Phase 5: SATURATION")
    prev_throughput = None
    for conc in [150, 200, 300, 500]:
        r = run_test(args.base_url, args.model, "SATURATION", f"extreme_{conc}",
                     conc, int(conc*multiplier), 500, 100, prev_throughput=prev_throughput)
        all_results.append(r)
        prev_throughput = r.throughput_total

    # Save results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    RESULTS_DIR.mkdir(exist_ok=True)
    with open(RESULTS_DIR / f"benchmark_{timestamp}.json", "w") as f:
        json.dump({"results": [asdict(r) for r in all_results]}, f, indent=2)

    print(f"\nResults saved to results/benchmark_{timestamp}.json")


if __name__ == "__main__":
    main()

Model Launch Scripts

DeepSeek V3.2 (685B)

#!/bin/bash
docker run --rm \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    --env "VLLM_ROCM_USE_AITER=1" \
    --env "AITER_ENABLE_VSKIP=0" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:latest \
    --model deepseek-ai/DeepSeek-V3.2 \
    --tensor-parallel-size 8 \
    --block-size 1 \
    --quantization fp8

Llama 3.1 (405B)

#!/bin/bash
docker run --rm \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:latest \
    --model meta-llama/Llama-3.1-405B-Instruct \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \
    --quantization fp8

Qwen3-VL (235B)

#!/bin/bash
docker run --rm \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:latest \
    --model Qwen/Qwen3-VL-235B-A22B-Instruct \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \
    --kv-offloading-backend native \
    --kv-offloading-size 64 \
    --disable-hybrid-kv-cache-manager

Running Benchmarks

Quick Start

bash
# 1. Start vLLM server
./start_deepseek_v32.sh

# 2. Wait for model to load (check logs for "Application startup complete")

# 3. Run validation test (quick)
python unified_benchmark.py --model deepseek-ai/DeepSeek-V3.2 --quick

# 4. Run full stress test (thorough)
python unified_benchmark.py --model deepseek-ai/DeepSeek-V3.2 --thorough

Vision Models

bash
# For vision-language models like Qwen3-VL
python unified_benchmark.py \
    --model Qwen/Qwen3-VL-235B-A22B-Instruct \
    --vision \
    --thorough

Understanding Results

Output Files

Each benchmark run produces:

  • results/unified_YYYYMMDD_HHMMSS.json - Raw data for analysis
  • results/unified_YYYYMMDD_HHMMSS.md - Human-readable report

Key Metrics

Metric Description
throughput_total Total tokens (input + output) per second
throughput_output Output tokens per second (generation speed)
latency_p99 99th percentile request latency
success_rate Percentage of successful requests

Interpreting Status

  • All OK: System is healthy, can handle more load
  • DEGRADED in scaling: Normal - latency increases with concurrency
  • SATURATED: Reached maximum throughput, adding load won't help
  • FAILING: Requests are failing, reduce load

Why GPU Utilization Shows ~5%

During high-throughput LLM inference, you may observe low GPU compute utilization (5-10%) in rocm-smi despite excellent throughput. This is expected because:

  • Memory bandwidth bound: LLM inference is limited by memory bandwidth (moving weights), not compute capacity
  • MoE sparse activation: Models like DeepSeek/Qwen activate only 5-10% of parameters per token
  • Efficient batching: vLLM's continuous batching reduces GPU idle time between requests

Monitor rocm-smi --showmeminfo vram and memory bandwidth instead of compute utilization.

Architecture Performance Characteristics

Our testing revealed distinct performance profiles for different architectures:

Architecture Example Strengths Weaknesses
MoE + GQA Qwen3-VL Highest throughput, best scaling Requires KV offloading workaround
Dense + GQA Llama-405B Most linear scaling, predictable Lower absolute throughput
MoE + MLA DeepSeek V3.2 Memory efficient, good single-request latency Scaling plateaus earlier, no KV offloading

When to choose each:

  • Qwen3-VL: Maximum throughput, vision tasks, high concurrency batch processing
  • Llama-405B: Consistent latency requirements, long context workloads
  • DeepSeek V3.2: Reasoning tasks, tool calling, memory-constrained environments

Test Environment

All benchmarks in this documentation were run on:

Specification Value
GPU 8x AMD Instinct MI325X
VRAM 256 GB HBM3E per GPU (2 TB total)
Architecture CDNA 3 (gfx942)
ROCm 6.4.2-120
vLLM 0.14.1

Comments