Deploy THUDM/Zhipu's GLM-5 on NVIDIA HGX B200 GPUs. This large MoE model introduces Differential Sparse Attention for efficient inference at 744B total parameters.
| Property | Value |
|---|---|
| Model ID | zai-org/GLM-5-FP8 |
| Architecture | MoE + Differential Sparse Attention (DSA) |
| Total Parameters | 744B |
| Active Parameters | ~40B per token |
| Attention | Differential Sparse Attention |
| Context Length | 128K tokens |
| Quantization | FP8 (pre-quantized) |
| License | MIT |
| Link | HuggingFace |
GLM-5 uses Differential Sparse Attention (DSA), a novel attention mechanism that:
Combined with MoE routing (~40B active of 744B total), GLM-5 balances model capacity with inference efficiency. The architecture is designed for instruction following, reasoning, and code generation.
Implications for NVIDIA HGX B200 deployment:
$ vllm serve zai-org/GLM-5-FP8 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
Or with Docker:
$ docker run --rm --gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:v0.16.0 \
--model zai-org/GLM-5-FP8 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
--trust-remote-code is required for the DSA attention implementation.
| Flag | Purpose |
|---|---|
--tensor-parallel-size 8 |
Full 8-GPU deployment required for 744B model |
--max-model-len 32768 |
Context window. Model supports up to 128K |
--gpu-memory-utilization 0.90 |
Reserve 90% of VRAM |
--trust-remote-code |
Required for DSA architecture |
With TP=8 on FP8:
| Component | Per GPU | Total (8 GPUs) |
|---|---|---|
| Model weights | ~89 GB | ~715 GB |
| KV cache (available) | ~64 GB | ~516 GB |
| VRAM used | ~161 GB | ~1,288 GB |
Benchmark parameters: 2048 input tokens, 512 output tokens, random dataset. TP=8 on 8x NVIDIA HGX B200.
| Concurrent | Output tok/s | TTFT (ms) | TPOT (ms) | ITL p99 (ms) |
|---|---|---|---|---|
| 1 | 54 | 183 | 18.31 | 18.45 |
| 8 | 317 | 1,249 | 22.85 | 22.20 |
| 16 | 606 | 788 | 24.89 | 24.94 |
| 32 | 884 | 1,341 | 33.56 | 31.86 |
| 64 | 1,456 | 2,031 | 39.94 | 474.93 |
| 128 | 2,132 | 3,558 | 52.93 | 569.76 |
| 256 | 2,071 | 6,207 | 111.12 | 580.49 |
| 512 | 1,955 | 57,202 | 139.19 | 584.22 |
| 1024 | 1,944 | 175,241 | 143.10 | 584.03 |
| Metric | Value |
|---|---|
| Peak sustained throughput | 2,132 tok/s (c=128) |
| Peak burst throughput | 5,114 tok/s |
| Saturation point | ~128 concurrent |
$ curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-5-FP8",
"messages": [{"role": "user", "content": "Explain Differential Sparse Attention"}],
"max_tokens": 256
}'
A community-quantized NVFP4 variant is available:
$ vllm serve lukealonso/GLM-5-NVFP4 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
NVFP4 could reduce the GPU requirement from TP=8 to TP=4, freeing 4 GPUs for another model. See FP8/NVFP4 Quantization.
GlmMoeDsaForCausalLM architecture was added in vLLM 0.16.0 (PR #34124). Earlier versions will fail with an "unrecognized architecture" error. Also requires transformers from git main (5.x+).nvcc is on PATH and CUTLASS headers are available in the DeepGEMM package (see Troubleshooting).--trust-remote-code for the DSA implementation.