Focus Mode

Benchmarking Guide

Updated on 11 March, 2026

This guide explains the methodology used for all benchmark results in this documentation, and provides the scripts to reproduce them.

Methodology

Our unified benchmark follows a progressive approach to stress testing:

WARMUP → BASELINE → SCALING → STRESS → SATURATION

Phase Overview

Phase	Purpose	What It Measures
WARMUP	Initialize model, warm caches	Single request baseline
BASELINE	Establish clean metrics	p99 latency at concurrency=1
SCALING	Find optimal concurrency	Throughput vs latency trade-off
STRESS	Test edge cases	Long context, long output, multi-image
SATURATION	Find breaking point	Maximum sustainable load

Status Indicators

Status	Meaning	Threshold
OK	Normal operation	Success rate ≥95%, latency acceptable
DEGRADED	Latency increased	p99 > 2x baseline p99
SATURATED	Throughput plateau	Throughput < 1.05x previous level
FAILING	Requests failing	Success rate < 95%

Note: DEGRADED status is expected under concurrent load - it indicates the latency trade-off for higher throughput, not a problem.

Test Modes

Mode	Multiplier	Use Case
`--quick`	0.5x	Fast validation, CI/CD
(default)	1x	Standard benchmarking
`--thorough`	3x	Production capacity planning

Benchmark Script

The unified benchmark script runs all phases automatically:

                            python
                            
                        
#!/usr/bin/env python3
"""
Unified Progressive Benchmark for vLLM

Usage:
  # Quick validation test
  python unified_benchmark.py --model deepseek-ai/DeepSeek-V3.2 --quick

  # Standard benchmark
  python unified_benchmark.py --model meta-llama/Llama-3.1-405B-Instruct

  # Thorough stress test
  python unified_benchmark.py --model Qwen/Qwen3-VL-235B-A22B-Instruct --vision --thorough

Options:
  --model         Model name (required)
  --vision        Enable vision tests for VLMs
  --quick         0.5x requests (fast validation)
  --thorough      3x requests (comprehensive stress test)
  --skip-saturation   Skip extreme load tests
  --completions   Use /v1/completions API instead of chat
"""

import argparse
import json
import requests
import time
import concurrent.futures
from datetime import datetime
from pathlib import Path
from dataclasses import dataclass, asdict

RESULTS_DIR = Path("./results")

TEST_IMAGES = [
    "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=1280",
    "https://images.unsplash.com/photo-1506905925346-21bda4d32df4?w=1280",
    "https://images.unsplash.com/photo-1504674900247-0877df9cc836?w=1280",
]

@dataclass
class TestResult:
    phase: str
    test_name: str
    test_type: str
    concurrency: int
    num_requests: int
    successful: int
    failed: int
    success_rate: float
    duration_sec: float
    latency_p50: float
    latency_p95: float
    latency_p99: float
    throughput_total: float
    throughput_output: float
    status: str  # OK, DEGRADED, SATURATED, FAILING


def make_request(base_url, model, prompt, max_tokens, image_urls=None):
    """Make a single API request."""
    start = time.time()
    try:
        if image_urls:
            content = [{"type": "text", "text": prompt}]
            for url in image_urls:
                content.append({"type": "image_url", "image_url": {"url": url}})
            messages = [{"role": "user", "content": content}]
        else:
            messages = [{"role": "user", "content": prompt}]

        resp = requests.post(
            f"{base_url}/v1/chat/completions",
            json={"model": model, "messages": messages, "max_tokens": max_tokens},
            timeout=300
        )
        elapsed = time.time() - start

        if resp.status_code == 200:
            data = resp.json()
            return {
                "success": True,
                "latency": elapsed,
                "prompt_tokens": data.get('usage', {}).get('prompt_tokens', 0),
                "completion_tokens": data.get('usage', {}).get('completion_tokens', 0),
            }
        return {"success": False, "latency": elapsed}
    except Exception as e:
        return {"success": False, "latency": time.time() - start}


def run_test(base_url, model, phase, test_name, concurrency, num_requests,
             input_tokens, output_tokens, images_per_request=0,
             baseline_p99=None, prev_throughput=None):
    """Run a test with given parameters."""
    # Generate prompt
    prompt = "Explain in detail: " * (input_tokens // 4)

    # Run concurrent requests
    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
        futures = []
        for i in range(num_requests):
            img_urls = TEST_IMAGES[:images_per_request] if images_per_request else None
            futures.append(executor.submit(
                make_request, base_url, model, prompt, output_tokens, img_urls
            ))
        for future in concurrent.futures.as_completed(futures):
            results.append(future.result())

    # Calculate metrics
    successful = [r for r in results if r["success"]]
    latencies = sorted([r["latency"] for r in successful])

    def percentile(data, p):
        k = (len(data) - 1) * p / 100
        f = int(k)
        return data[f] if f == int(k) else data[f] + (k - f) * (data[min(f+1, len(data)-1)] - data[f])

    total_tokens = sum(r.get("prompt_tokens", 0) + r.get("completion_tokens", 0) for r in successful)
    duration = max(r["latency"] for r in results) if results else 1
    throughput = total_tokens / duration

    p99 = percentile(latencies, 99) if latencies else 0
    success_rate = len(successful) / num_requests

    # Determine status
    if success_rate < 0.95:
        status = "FAILING"
    elif baseline_p99 and p99 > baseline_p99 * 2:
        status = "DEGRADED"
    elif prev_throughput and throughput < prev_throughput * 1.05:
        status = "SATURATED"
    else:
        status = "OK"

    return TestResult(
        phase=phase, test_name=test_name, test_type="vision" if images_per_request else "text",
        concurrency=concurrency, num_requests=num_requests,
        successful=len(successful), failed=len(results) - len(successful),
        success_rate=success_rate, duration_sec=duration,
        latency_p50=percentile(latencies, 50) if latencies else 0,
        latency_p95=percentile(latencies, 95) if latencies else 0,
        latency_p99=p99,
        throughput_total=throughput,
        throughput_output=sum(r.get("completion_tokens", 0) for r in successful) / duration,
        status=status
    )


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--base-url", default="http://localhost:8000")
    parser.add_argument("--model", required=True)
    parser.add_argument("--vision", action="store_true")
    parser.add_argument("--quick", action="store_true")
    parser.add_argument("--thorough", action="store_true")
    args = parser.parse_args()

    multiplier = 0.5 if args.quick else 3 if args.thorough else 1
    all_results = []

    # Phase 1: Warmup
    print("Phase 1: WARMUP")
    r = run_test(args.base_url, args.model, "WARMUP", "single", 1, 1, 100, 50)
    all_results.append(r)

    # Phase 2: Baseline
    print("Phase 2: BASELINE")
    r = run_test(args.base_url, args.model, "BASELINE", "baseline", 1, int(10*multiplier), 500, 100)
    all_results.append(r)
    baseline_p99 = r.latency_p99

    # Phase 3: Scaling
    print("Phase 3: SCALING")
    prev_throughput = None
    for conc in [5, 10, 25, 50, 75, 100]:
        r = run_test(args.base_url, args.model, "SCALING", f"scale_{conc}",
                     conc, int(conc*2*multiplier), 500, 200,
                     baseline_p99=baseline_p99, prev_throughput=prev_throughput)
        all_results.append(r)
        prev_throughput = r.throughput_total
        if r.status in ["SATURATED", "FAILING"]:
            break

    # Phase 4: Stress Tests
    print("Phase 4: STRESS")
    r = run_test(args.base_url, args.model, "STRESS", "long_output", 50, int(30*multiplier), 200, 500)
    all_results.append(r)
    r = run_test(args.base_url, args.model, "STRESS", "long_context", 25, int(15*multiplier), 4000, 200)
    all_results.append(r)

    # Phase 5: Saturation
    print("Phase 5: SATURATION")
    prev_throughput = None
    for conc in [150, 200, 300, 500]:
        r = run_test(args.base_url, args.model, "SATURATION", f"extreme_{conc}",
                     conc, int(conc*multiplier), 500, 100, prev_throughput=prev_throughput)
        all_results.append(r)
        prev_throughput = r.throughput_total

    # Save results
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    RESULTS_DIR.mkdir(exist_ok=True)
    with open(RESULTS_DIR / f"benchmark_{timestamp}.json", "w") as f:
        json.dump({"results": [asdict(r) for r in all_results]}, f, indent=2)

    print(f"\nResults saved to results/benchmark_{timestamp}.json")


if __name__ == "__main__":
    main()

Model Launch Scripts

DeepSeek V3.2 (685B)

#!/bin/bash
docker run --rm \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    --env "VLLM_ROCM_USE_AITER=1" \
    --env "AITER_ENABLE_VSKIP=0" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:latest \
    --model deepseek-ai/DeepSeek-V3.2 \
    --tensor-parallel-size 8 \
    --block-size 1 \
    --quantization fp8

Llama 3.1 (405B)

#!/bin/bash
docker run --rm \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:latest \
    --model meta-llama/Llama-3.1-405B-Instruct \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \
    --quantization fp8

Qwen3-VL (235B)

#!/bin/bash
docker run --rm \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=$HF_TOKEN" \
    --env "VLLM_USE_TRITON_FLASH_ATTN=0" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:latest \
    --model Qwen/Qwen3-VL-235B-A22B-Instruct \
    --tensor-parallel-size 8 \
    --max-model-len 32768 \
    --kv-offloading-backend native \
    --kv-offloading-size 64 \
    --disable-hybrid-kv-cache-manager

Running Benchmarks

Quick Start

                            bash
                            
                        
# 1. Start vLLM server
./start_deepseek_v32.sh

# 2. Wait for model to load (check logs for "Application startup complete")

# 3. Run validation test (quick)
python unified_benchmark.py --model deepseek-ai/DeepSeek-V3.2 --quick

# 4. Run full stress test (thorough)
python unified_benchmark.py --model deepseek-ai/DeepSeek-V3.2 --thorough

Vision Models

                            bash
                            
                        
# For vision-language models like Qwen3-VL
python unified_benchmark.py \
    --model Qwen/Qwen3-VL-235B-A22B-Instruct \
    --vision \
    --thorough

Understanding Results

Output Files

Each benchmark run produces:

results/unified_YYYYMMDD_HHMMSS.json - Raw data for analysis
results/unified_YYYYMMDD_HHMMSS.md - Human-readable report

Key Metrics

Metric	Description
throughput_total	Total tokens (input + output) per second
throughput_output	Output tokens per second (generation speed)
latency_p99	99th percentile request latency
success_rate	Percentage of successful requests

Interpreting Status

All OK: System is healthy, can handle more load
DEGRADED in scaling: Normal - latency increases with concurrency
SATURATED: Reached maximum throughput, adding load won't help
FAILING: Requests are failing, reduce load

Why GPU Utilization Shows ~5%

During high-throughput LLM inference, you may observe low GPU compute utilization (5-10%) in rocm-smi despite excellent throughput. This is expected because:

Memory bandwidth bound: LLM inference is limited by memory bandwidth (moving weights), not compute capacity
MoE sparse activation: Models like DeepSeek/Qwen activate only 5-10% of parameters per token
Efficient batching: vLLM's continuous batching reduces GPU idle time between requests

Monitor rocm-smi --showmeminfo vram and memory bandwidth instead of compute utilization.

Architecture Performance Characteristics

Our testing revealed distinct performance profiles for different architectures:

Architecture	Example	Strengths	Weaknesses
MoE + GQA	Qwen3-VL	Highest throughput, best scaling	Requires KV offloading workaround
Dense + GQA	Llama-405B	Most linear scaling, predictable	Lower absolute throughput
MoE + MLA	DeepSeek V3.2	Memory efficient, good single-request latency	Scaling plateaus earlier, no KV offloading

When to choose each:

Qwen3-VL: Maximum throughput, vision tasks, high concurrency batch processing
Llama-405B: Consistent latency requirements, long context workloads
DeepSeek V3.2: Reasoning tasks, tool calling, memory-constrained environments

Test Environment

All benchmarks in this documentation were run on:

Specification	Value
GPU	8x AMD Instinct MI325X
VRAM	256 GB HBM3E per GPU (2 TB total)
Architecture	CDNA 3 (gfx942)
ROCm	6.4.2-120
vLLM	0.14.1