Latest ContentInference Cookbook Model Library

Qwen3 Nemotron 235B A22B GenRM

Qwen3 Nemotron 235B A22B GenRM is a large-scale Generative Reward Model (GenRM) built on the Qwen3-235B-A22B foundation and fine-tuned to evaluate assistant responses for helpfulness and quality. Based on a 235B-parameter Mixture-of-Experts (MoE) transformer, it features 94 layers, 64 attention heads, 128 experts (8 experts per token), and a 4,096 hidden size. The model processes up to 131K tokens and outputs structured helpfulness and ranking scores for candidate responses. It is designed for reinforcement learning from human feedback (RLHF), large-scale preference modeling, and advanced AI alignment workflows.

Type	MoE GenRM
Capabilities	Text Generation, Instruction Following, Reasoning, Mathematical Reasoning+5 more
Release Date	15 December, 2025
Links	Paper\|Blog\|HF Model Card
License	nvidia-open-model-license

Inference Instructions

Deploy and run this model on NVIDIA B200 GPUs using the command below. Copy the command to get started with inference.

CONSOLE

docker run --gpus all 
 --shm-size 128g 
 -p 8000:8000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 -e HF_TOKEN='YOUR_HF_TOKEN' 
 --ipc=host 
 lmsysorg/sglang:v0.5.8-cu130 
 python3 -m sglang.launch_server 
 --model-path nvidia/Qwen3-Nemotron-235B-A22B-GenRM 
 --host 0.0.0.0 
 --port 8000 
 --max-prefill-tokens 65536 
 --max-running-requests 1024 
 --tp 8 
 --mem-fraction-static 0.95 
 --trust-remote-code

Model Benchmarks

Each model was tested with a fixed input size and total token volume while increasing concurrency to measure serving performance under load.

Qwen3 Nemotron 235B A22B GenRM

Inference Instructions

Model Benchmarks

ITL vs Concurrency

Time to First Token

Throughput Scaling

Total Tokens/sec vs Avg TTFT

NVIDIA HGX B200

Products

Features

Solutions

Marketplace

Resources

Company

Tech Talks

Vultr Blogs