Qwen3 Nemotron 235B A22B GenRM icon

Qwen3 Nemotron 235B A22B GenRM

NVIDIA
Qwen3-Nemotron-235B-A22B-GenRM is a large-scale Generative Reward Model (GenRM) built on the Qwen3-235B-A22B foundation and fine-tuned to evaluate assistant responses for helpfulness and quality. Based on a 235B-parameter Mixture-of-Experts (MoE) transformer, it features 94 layers, 64 attention heads, 128 experts (8 experts per token), and a 4,096 hidden size. The model processes up to 131K tokens and outputs structured helpfulness and ranking scores for candidate responses. It is designed for reinforcement learning from human feedback (RLHF), large-scale preference modeling, and advanced AI alignment workflows.
TypeMoE GenRM
CapabilitiesText Generation, Instruction Following, Reasoning, Mathematical Reasoning+5 more
Links
Licensenvidia-open-model-license

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE
docker run --gpus all 
 --shm-size 128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 --ipc=host 
 lmsysorg/sglang:v0.5.8-cu130 
 python3 -m sglang.launch_server 
 --model-path nvidia/Qwen3-Nemotron-235B-A22B-GenRM 
 --host 0.0.0.0 
 --port 8000 
 --max-prefill-tokens 65536 
 --max-running-requests 1024 
 --tp 8 
 --mem-fraction-static 0.95 
 --trust-remote-code

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

Vultr Cloud GPU

NVIDIA HGX B200

Deploy NVIDIA B200 on Vultr Cloud GPU