The NVIDIA HGX B200 is a Blackwell-architecture data center GPU designed for large-scale AI inference and training. NVIDIA HGX B200 servers provide 8 GPUs per node connected via NVSwitch 5.0.
| Specification | Value |
|---|---|
| Architecture | Blackwell (sm_100) |
| VRAM | 179 GB HBM3e |
| Memory Bandwidth | 8.0 TB/s |
| FP8 Compute | ~4.5 PFLOPS |
| FP4 Compute | ~9.0 PFLOPS (NVFP4) |
| TDP | 1000W |
| Interconnect | NVLink 5.0 / NVSwitch 5.0 |
LLM inference is primarily memory-bandwidth-bound during the decode phase (generating tokens one at a time). The NVIDIA HGX B200's 8.0 TB/s memory bandwidth is among the highest available, directly translating to faster token generation for memory-bound workloads.
Key NVIDIA HGX B200 advantages for inference:
The NVIDIA HGX B200 uses an 8-GPU configuration with all-to-all NVSwitch connectivity:
# Verify topology
$ nvidia-smi topo -m
All GPU pairs connect through NVSwitch 5.0, providing 1.8 TB/s bidirectional bandwidth between any two GPUs in the node. This is critical for tensor parallelism (weight sharding across GPUs) and for NVIDIA Dynamo's disaggregated serving (KV cache transfer between prefill and decode pools).
Verified topology (abbreviated):
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 XEvery GPU pair shows NV18 — 18 bonded NVLinks via NVSwitch 5.0. GPUs 0-3 are on NUMA node 0 (CPUs 0-63, 128-191) and GPUs 4-7 are on NUMA node 1 (CPUs 64-127, 192-255). The node has 14x Mellanox ConnectX NICs (mlx5_0 through mlx5_13) for network connectivity.
These specifications were verified on the NVIDIA HGX B200 instance used for this cookbook:
$ nvidia-smi --query-gpu=index,name,memory.total,driver_version --format=csv
index, name, memory.total, driver_version
0, NVIDIA B200, 183359 MiB, 580.105.08
1, NVIDIA B200, 183359 MiB, 580.105.08
2, NVIDIA B200, 183359 MiB, 580.105.08
3, NVIDIA B200, 183359 MiB, 580.105.08
4, NVIDIA B200, 183359 MiB, 580.105.08
5, NVIDIA B200, 183359 MiB, 580.105.08
6, NVIDIA B200, 183359 MiB, 580.105.08
7, NVIDIA B200, 183359 MiB, 580.105.08
| Property | Value |
|---|---|
| GPUs | 8x NVIDIA HGX B200 |
| VRAM per GPU | 183,359 MiB (~179 GB) |
| Total VRAM | ~1.43 TB |
| Driver | 580.105.08 |
| CUDA | 13.0 |
| Persistence Mode | Enabled |
| Compute Capability | sm_100 (10.0) |
Each NVIDIA HGX B200 draws up to 1000W under load. At idle, GPUs draw approximately 140W each. During inference benchmarks, power consumption varies with load:
| State | Power per GPU | Total (8 GPUs) |
|---|---|---|
| Idle | ~140W | ~1,120W |
| Light load | ~200W | ~1,600W |
| Full load | ~700-1000W | ~5,600-8,000W |
Power efficiency (tok/s per watt) is a useful metric for production deployments but is not the primary focus of this cookbook.