GLM 5 FP8 icon

GLM 5 FP8

NVIDIA
GLM-5-FP8 is a large Mixture-of-Experts (MoE) language model developed by Z.ai for complex reasoning, coding, and long-horizon agentic workflows. The model features a 744B parameter architecture with 40B activated parameters, built on a 78-layer transformer with 64 attention heads and a 6,144 hidden size, utilizing 256 routed experts with 8 experts activated per token. It integrates DeepSeek Sparse Attention (DSA) to reduce deployment cost while maintaining performance, and supports a ~202K token context window for large-scale multi-step reasoning tasks.
TypeMoE LLM
CapabilitiesText Generation, Instruction Following, Reasoning, Mathematical Reasoning+5 more
Group Release DateFebruary 11, 2026
Links
LicenseMIT

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE
docker run -it --rm 
 --runtime=nvidia 
 --gpus all 
 --ipc=host 
 --shm-size=128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e VLLM_FLOAT32_MATMUL_PRECISION=high 
 -e VLLM_USE_FLASHINFER_MOE_FP8=1 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 -e LD_LIBRARY_PATH='/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu' 
 vllm/vllm-openai:glm5 
 zai-org/GLM-5-FP8 
  --tensor-parallel-size 8 
 --max-model-len auto 
  --gpu-memory-utilization 0.90 
 --tool-call-parser glm47 
 --reasoning-parser glm45 
 --enable-auto-tool-choice 
 --max-num-seqs 1024 
 --disable-log-requests 
 --trust-remote-code
Note

Due to architectural limitations involving head size (576) and Sparse MLA, the --kv-cache-dtype fp8 flag is currently unsupported as it lacks a valid attention backend for this model configuration.

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

Vultr Cloud GPU

NVIDIA HGX B200

Deploy NVIDIA B200 on Vultr Cloud GPU