Focus Mode

GLM-5

Updated on 11 March, 2026

Deploy THUDM/Zhipu's GLM-5 on NVIDIA HGX B200 GPUs. This large MoE model introduces Differential Sparse Attention for efficient inference at 744B total parameters.

Model Overview

Property	Value
Model ID	`zai-org/GLM-5-FP8`
Architecture	MoE + Differential Sparse Attention (DSA)
Total Parameters	744B
Active Parameters	~40B per token
Attention	Differential Sparse Attention
Context Length	128K tokens
Quantization	FP8 (pre-quantized)
License	MIT
Link	HuggingFace

Architecture

GLM-5 uses Differential Sparse Attention (DSA), a novel attention mechanism that:

Selectively attends to important tokens, discarding low-relevance entries
Reduces effective KV cache size by compressing unimportant positions
Uses differential scoring to identify which tokens contribute meaningfully to generation

Combined with MoE routing (~40B active of 744B total), GLM-5 balances model capacity with inference efficiency. The architecture is designed for instruction following, reasoning, and code generation.

Implications for NVIDIA HGX B200 deployment:

Full node required: 744B total params in FP8 requires TP=8 (all 8 GPUs)
DSA reduces KV pressure: Sparse attention means less KV cache consumed per token than standard GQA
Memory-bound at decode: With ~89 GB per GPU in weights, only ~64 GB remains for KV cache

Quick Start

                            console
                            
                        
$ vllm serve zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Or with Docker:

                            console
                            
                        
$ docker run --rm --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:v0.16.0 \
  --model zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Note

--trust-remote-code is required for the DSA attention implementation.

Configuration

Flag	Purpose
`--tensor-parallel-size 8`	Full 8-GPU deployment required for 744B model
`--max-model-len 32768`	Context window. Model supports up to 128K
`--gpu-memory-utilization 0.90`	Reserve 90% of VRAM
`--trust-remote-code`	Required for DSA architecture

Memory Usage (NVIDIA HGX B200 Verified)

With TP=8 on FP8:

Component	Per GPU	Total (8 GPUs)
Model weights	~89 GB	~715 GB
KV cache (available)	~64 GB	~516 GB
VRAM used	~161 GB	~1,288 GB

Note

vLLM reported 89.43 GiB model loading and 64.49 GiB available for KV cache, totalling 691,392 tokens of cache capacity.

Performance (NVIDIA HGX B200 Verified)

Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=8 on 8x NVIDIA HGX B200.

Concurrency Scaling

Concurrent	Output tok/s	TTFT (ms)	TPOT (ms)	ITL p99 (ms)
1	54	183	18.31	18.45
8	317	1,249	22.85	22.20
16	606	788	24.89	24.94
32	884	1,341	33.56	31.86
64	1,456	2,031	39.94	474.93
128	2,132	3,558	52.93	569.76
256	2,071	6,207	111.12	580.49
512	1,955	57,202	139.19	584.22
1024	1,944	175,241	143.10	584.03

Peak Performance

Metric	Value
Peak sustained throughput	2,132 tok/s (c=128)
Peak burst throughput	5,114 tok/s
Saturation point	~128 concurrent

Key Observations

Early saturation at c=128 — GLM-5 peaks at 2,132 tok/s with 128 concurrent requests, well below MiniMax M2.5's ~512 saturation point. This is driven by the larger active parameter count (40B vs 10B) consuming more memory bandwidth per token.
KV cache pressure — With ~89 GB model weights per GPU, only ~64 GB remains for KV cache (691K tokens total). At high concurrency, KV cache becomes the bottleneck.
ITL p99 spike at c=64 — Inter-token latency jumps from 32ms to 475ms between c=32 and c=64, indicating the decode phase begins competing for memory bandwidth with prefill.
TTFT degradation — TTFT grows dramatically beyond c=128 (3.6s → 57s → 175s), confirming prefill queuing under memory pressure.

Test Endpoints

Chat Completion

                            console
                            
                        
$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-5-FP8",
    "messages": [{"role": "user", "content": "Explain Differential Sparse Attention"}],
    "max_tokens": 256
  }'

NVFP4 Variant

A community-quantized NVFP4 variant is available:

                            console
                            
                        
$ vllm serve lukealonso/GLM-5-NVFP4 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

NVFP4 could reduce the GPU requirement from TP=8 to TP=4, freeing 4 GPUs for another model. See FP8/NVFP4 Quantization.

Known Issues

vLLM 0.16.0+ required — GLM-5's GlmMoeDsaForCausalLM architecture was added in vLLM 0.16.0 (PR #34124). Earlier versions will fail with an "unrecognized architecture" error. Also requires transformers from git main (5.x+).
DeepGEMM JIT needs CUTLASS — The FP8 kernels use DeepGEMM, which JIT-compiles CUDA code at first run. Ensure nvcc is on PATH and CUTLASS headers are available in the DeepGEMM package (see Troubleshooting).
Long first-run warmup — DeepGEMM warms up ~2,259 JIT kernels on first launch (~5 minutes). Subsequent launches use cached kernels.
Large download — GLM-5 FP8 is ~705 GB. Pre-download before deployment.
Custom code — Always pass --trust-remote-code for the DSA implementation.

GLM-5

Model Overview

Architecture

Quick Start

Configuration

Memory Usage (NVIDIA HGX B200 Verified)

Performance (NVIDIA HGX B200 Verified)

Concurrency Scaling

Peak Performance

Key Observations

Test Endpoints

Chat Completion

NVFP4 Variant

Known Issues

Comments

Products

Features

Solutions

Marketplace

Resources

Company