Throughput Saturation Curves

Latency at High Concurrency

Summary
| Model |
Concurrency at Peak Throughput |
Peak Throughput (tok/s) |
p99 Latency at Peak |
Throughput Range (min-max) |
| Qwen3-VL-235B-A22B |
700 |
60,131 tok/s |
4.27s |
59,742 – 60,131 tok/s |
| Llama-3.1-405B |
500 |
34,050 tok/s |
7.80s |
33,376 – 34,050 tok/s |
| DeepSeek V3.2 |
500 |
37,413 tok/s |
6.49s |
36,693 – 37,413 tok/s |
| Kimi-K2.5 |
650 |
7,340 tok/s |
35.97s |
7,250 – 7,340 tok/s |
Key Findings
Saturation Behavior by Architecture
- Qwen3-VL-235B maintains throughput around 59,700-60,100 tok/s across the entire 500-1000 range, showing the smallest active parameter footprint (22B) translates to the highest throughput ceiling.
- DeepSeek V3.2 plateaus around 36,700-37,400 tok/s, with minimal variance across the sweep. The MoE routing overhead stabilizes at high concurrency.
- Llama-405B sustains around 33,400-34,100 tok/s. As a dense model with FP8 quantization, it achieves remarkably high total throughput despite its 405B parameter count.
- Kimi-K2.5 stabilizes around 7,250-7,340 tok/s. Despite being the largest model (1T params), TP=4 limits its parallelism bandwidth.
Practical Implications
All models are fully saturated by 500 concurrent requests. Operating beyond 750 concurrent provides no throughput benefit and only increases tail latency. For production deployments, target the 200-500 range for optimal throughput-to-latency tradeoff.
Per-Model Detail
DeepSeek V3.2
| Concurrency |
Throughput Mean ± CI95 |
Output Throughput |
p99 Latency |
p50 Latency |
| 500 |
37,413 ± 82 tok/s |
745 tok/s |
6.4915s |
0.3935s |
| 550 |
37,106 ± 102 tok/s |
744 tok/s |
6.5826s |
0.4376s |
| 600 |
36,718 ± 1,324 tok/s |
812 tok/s |
6.6742s |
0.4216s |
| 650 |
37,157 ± 248 tok/s |
755 tok/s |
6.5508s |
0.4080s |
| 700 |
36,817 ± 709 tok/s |
780 tok/s |
6.6233s |
0.4079s |
| 750 |
36,879 ± 854 tok/s |
771 tok/s |
6.5957s |
0.4150s |
| 800 |
36,825 ± 640 tok/s |
732 tok/s |
6.6307s |
0.4517s |
| 850 |
36,693 ± 512 tok/s |
753 tok/s |
6.6265s |
0.4142s |
| 900 |
36,859 ± 662 tok/s |
665 tok/s |
6.5952s |
0.4241s |
| 950 |
36,911 ± 184 tok/s |
742 tok/s |
6.6158s |
0.4511s |
| 1000 |
36,697 ± 1,476 tok/s |
897 tok/s |
6.6644s |
0.4148s |
Llama-3.1-405B
| Concurrency |
Throughput Mean ± CI95 |
Output Throughput |
p99 Latency |
p50 Latency |
| 500 |
34,050 ± 1,149 tok/s |
2,457 tok/s |
7.7987s |
7.7473s |
| 550 |
33,736 ± 642 tok/s |
2,369 tok/s |
7.8475s |
7.8003s |
| 600 |
33,393 ± 26 tok/s |
2,464 tok/s |
7.9579s |
7.9149s |
| 650 |
33,959 ± 1,291 tok/s |
2,452 tok/s |
7.8168s |
7.7641s |
| 700 |
33,398 ± 76 tok/s |
2,457 tok/s |
7.9629s |
7.9076s |
| 750 |
33,442 ± 25 tok/s |
2,464 tok/s |
7.9392s |
7.8950s |
| 800 |
33,406 ± 44 tok/s |
2,469 tok/s |
7.9690s |
7.9060s |
| 850 |
33,376 ± 105 tok/s |
2,437 tok/s |
7.9624s |
7.9059s |
| 900 |
33,385 ± 72 tok/s |
2,464 tok/s |
7.9654s |
7.9142s |
| 950 |
33,440 ± 41 tok/s |
2,457 tok/s |
7.9508s |
7.8954s |
| 1000 |
33,398 ± 15 tok/s |
2,461 tok/s |
7.9597s |
7.9067s |
Qwen3-VL-235B
| Concurrency |
Throughput Mean ± CI95 |
Output Throughput |
p99 Latency |
p50 Latency |
| 500 |
60,085 ± 135 tok/s |
4,528 tok/s |
4.2601s |
4.2342s |
| 550 |
59,895 ± 163 tok/s |
4,514 tok/s |
4.2810s |
4.2506s |
| 600 |
60,014 ± 109 tok/s |
4,523 tok/s |
4.2684s |
4.2392s |
| 650 |
59,918 ± 336 tok/s |
4,515 tok/s |
4.2776s |
4.2527s |
| 700 |
60,131 ± 294 tok/s |
4,531 tok/s |
4.2693s |
4.2358s |
| 750 |
59,811 ± 132 tok/s |
4,507 tok/s |
4.2834s |
4.2535s |
| 800 |
60,062 ± 59 tok/s |
4,526 tok/s |
4.2637s |
4.2378s |
| 850 |
59,809 ± 162 tok/s |
4,507 tok/s |
4.2826s |
4.2522s |
| 900 |
59,754 ± 281 tok/s |
4,503 tok/s |
4.2862s |
4.2552s |
| 950 |
60,002 ± 284 tok/s |
4,522 tok/s |
4.2639s |
4.2406s |
| 1000 |
59,742 ± 77 tok/s |
4,502 tok/s |
4.2897s |
4.2595s |
Kimi-K2.5
| Concurrency |
Throughput Mean ± CI95 |
Output Throughput |
p99 Latency |
p50 Latency |
| 500 |
7,316 ± 49 tok/s |
552 tok/s |
36.0929s |
35.7474s |
| 550 |
7,331 ± 45 tok/s |
553 tok/s |
35.9924s |
35.6532s |
| 600 |
7,268 ± 51 tok/s |
548 tok/s |
36.3368s |
36.0086s |
| 650 |
7,340 ± 57 tok/s |
554 tok/s |
35.9741s |
35.5474s |
| 700 |
7,282 ± 80 tok/s |
549 tok/s |
36.2510s |
35.8314s |
| 750 |
7,256 ± 92 tok/s |
547 tok/s |
36.3904s |
36.0571s |
| 800 |
7,272 ± 17 tok/s |
548 tok/s |
36.3158s |
36.0500s |
| 850 |
7,289 ± 3 tok/s |
550 tok/s |
36.2254s |
35.7805s |
| 900 |
7,250 ± 16 tok/s |
547 tok/s |
36.4129s |
36.1525s |
| 950 |
7,292 ± 56 tok/s |
550 tok/s |
36.2153s |
35.8572s |
| 1000 |
7,260 ± 94 tok/s |
547 tok/s |
36.3648s |
35.9700s |