GLM-5

Updated on 11 March, 2026

Deploy THUDM/Zhipu's GLM-5 on NVIDIA HGX B200 GPUs. This large MoE model introduces Differential Sparse Attention for efficient inference at 744B total parameters.


Model Overview

Property Value
Model ID zai-org/GLM-5-FP8
Architecture MoE + Differential Sparse Attention (DSA)
Total Parameters 744B
Active Parameters ~40B per token
Attention Differential Sparse Attention
Context Length 128K tokens
Quantization FP8 (pre-quantized)
License MIT
Link HuggingFace

Architecture

GLM-5 uses Differential Sparse Attention (DSA), a novel attention mechanism that:

  • Selectively attends to important tokens, discarding low-relevance entries
  • Reduces effective KV cache size by compressing unimportant positions
  • Uses differential scoring to identify which tokens contribute meaningfully to generation

Combined with MoE routing (~40B active of 744B total), GLM-5 balances model capacity with inference efficiency. The architecture is designed for instruction following, reasoning, and code generation.

Implications for NVIDIA HGX B200 deployment:

  1. Full node required: 744B total params in FP8 requires TP=8 (all 8 GPUs)
  2. DSA reduces KV pressure: Sparse attention means less KV cache consumed per token than standard GQA
  3. Memory-bound at decode: With ~89 GB per GPU in weights, only ~64 GB remains for KV cache

Quick Start

console
$ vllm serve zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Or with Docker:

console
$ docker run --rm --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.16.0 \
  --model zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code
Note
--trust-remote-code is required for the DSA attention implementation.

Configuration

Flag Purpose
--tensor-parallel-size 8 Full 8-GPU deployment required for 744B model
--max-model-len 32768 Context window. Model supports up to 128K
--gpu-memory-utilization 0.90 Reserve 90% of VRAM
--trust-remote-code Required for DSA architecture

Memory Usage (NVIDIA HGX B200 Verified)

With TP=8 on FP8:

Component Per GPU Total (8 GPUs)
Model weights ~89 GB ~715 GB
KV cache (available) ~64 GB ~516 GB
VRAM used ~161 GB ~1,288 GB
Note
vLLM reported 89.43 GiB model loading and 64.49 GiB available for KV cache, totalling 691,392 tokens of cache capacity.

Performance (NVIDIA HGX B200 Verified)

Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=8 on 8x NVIDIA HGX B200.

Concurrency Scaling

Concurrent Output tok/s TTFT (ms) TPOT (ms) ITL p99 (ms)
1 54 183 18.31 18.45
8 317 1,249 22.85 22.20
16 606 788 24.89 24.94
32 884 1,341 33.56 31.86
64 1,456 2,031 39.94 474.93
128 2,132 3,558 52.93 569.76
256 2,071 6,207 111.12 580.49
512 1,955 57,202 139.19 584.22
1024 1,944 175,241 143.10 584.03

Peak Performance

Metric Value
Peak sustained throughput 2,132 tok/s (c=128)
Peak burst throughput 5,114 tok/s
Saturation point ~128 concurrent

Key Observations

  • Early saturation at c=128 — GLM-5 peaks at 2,132 tok/s with 128 concurrent requests, well below MiniMax M2.5's ~512 saturation point. This is driven by the larger active parameter count (40B vs 10B) consuming more memory bandwidth per token.
  • KV cache pressure — With ~89 GB model weights per GPU, only ~64 GB remains for KV cache (691K tokens total). At high concurrency, KV cache becomes the bottleneck.
  • ITL p99 spike at c=64 — Inter-token latency jumps from 32ms to 475ms between c=32 and c=64, indicating the decode phase begins competing for memory bandwidth with prefill.
  • TTFT degradation — TTFT grows dramatically beyond c=128 (3.6s → 57s → 175s), confirming prefill queuing under memory pressure.

Test Endpoints

Chat Completion

console
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-5-FP8",
    "messages": [{"role": "user", "content": "Explain Differential Sparse Attention"}],
    "max_tokens": 256
  }'

NVFP4 Variant

A community-quantized NVFP4 variant is available:

console
$ vllm serve lukealonso/GLM-5-NVFP4 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

NVFP4 could reduce the GPU requirement from TP=8 to TP=4, freeing 4 GPUs for another model. See FP8/NVFP4 Quantization.

Known Issues

  • vLLM 0.16.0+ required — GLM-5's GlmMoeDsaForCausalLM architecture was added in vLLM 0.16.0 (PR #34124). Earlier versions will fail with an "unrecognized architecture" error. Also requires transformers from git main (5.x+).
  • DeepGEMM JIT needs CUTLASS — The FP8 kernels use DeepGEMM, which JIT-compiles CUDA code at first run. Ensure nvcc is on PATH and CUTLASS headers are available in the DeepGEMM package (see Troubleshooting).
  • Long first-run warmup — DeepGEMM warms up ~2,259 JIT kernels on first launch (~5 minutes). Subsequent launches use cached kernels.
  • Large download — GLM-5 FP8 is ~705 GB. Pre-download before deployment.
  • Custom code — Always pass --trust-remote-code for the DSA implementation.

Comments