Building an Nvidia H100 Superpod Exaflop Machine for Deep Learning Training and Inference

Karthik G Vaithianathan
3 min readMay 15, 2023

In this blog, we’ll see how to build an Exaflop machine using Nvidia’s GPU H100 based on its latest Hopper architecture.

Nvidia H100 is manufactured using a TSMC 4nm node, 814 sq mm, with 700 W TDP, peak performance of 4 PFLOPs (int8/fp8) with energy efficiency at 5.7 TFLOPs/watt.

256 nos of H100s were used to build the 1 Exaflops machine which means approximate power consumed just by the GPUs alone = 700 Wh x 256 ~180 kWh. If we were to operate this server the entire day, the electricity cost would be about $500 per day and $15,000 per month (in Bangalore).

What would it cost to build one?
CDW's list price for H100 is $30,000 (approx). https://lnkd.in/gmXd3Mwr
GPUs alone will cost $7.68 million approx.

H100 has a 5120-bit wide bus to HBM3 memory, giving 3 TB/sec DRAM bandwidth and 80 GB capacity.

Each H100, with 18 NVLink Gen4 ports, is totally capable of 900 GB/sec bandwidth (18 x 50 GB/sec) on the NVLink4 interface.

Unlike the PCIe memory model, the new NVLink4-based fabric allows a flat memory view i.e., any GPU can access the HBM3 memory of any other GPU in the fabric like its own memory (competes with CXL3).

DGX H100 SERVER (DGXS) is the server built using H100 with the following components.
1.
Layer 1 NVSwitch Gen3, 4 per server. Each NVSwitch is made of one NVSwitch chip that has 64 ports, each port is NVLink4 compatible, 50 GB/sec per port, total duplex bandwidth of 3.2 TB/sec.

2. Eight, H100s, are connected via four NVSwitches i.e., 18 NVLinks (18 NVLink ports) from each H100 are split 5,4,4,5 and connected to four NVSwitch Gen3 switches. The total per GPU fanout bandwidth is 18 x 50 GB/sec = 900 GB/sec (bi-directional).

3. Bisection bandwidth (vertical cut dividing the four switches into two on each side) is 40 + 32 = 72 NVLink Ports x 50 GB/sec = 3.6 TB/sec.

4. Each DGX H100 has a fanout of 36 NVLink Network OSFPs, each OSFP of 100 GB/sec or 800 Gbps optical connector (2 NVLinks per OSFP) with a total bandwidth of 3.6 TB/sec on the interface. i.e., similar to H100, instead of 18 NVLink4 ports, there are 18 NVLink4 OSFP cages or 72 NVLink4 ports. It can be noted that half the bandwidth of each GPU (450 GB/sec) x 8 = 3.6 TB/sec is being allocated to internal GPU-GPU communication while the remaining half of the bandwidth is being allocated to cluster-to-cluster communication.

5. Each DGX H100 also has 800 GB/sec of aggregate full-duplex to non-NVLink Network devices. Four OSFP cages are allocated for this.

DGX H100 SUPERPOD is built using DGX H100 servers as follows.
1. DGXS node with 3.6 TB/sec interface bandwidth, each with 18 NVLink4 OSFPs or 72 NVLink ports.

2. Layer 2, NVSwitch Gen3, made of two NVSwitch chips, 128 ported, 32 OSFP cages, 50 GB/sec per port, with a total bidirectional bandwidth of 6.4 TB/sec.

3. 32 DGXS nodes are connected to 18 Layer2 NVSwitches in a fully connected topology using Infiniband (IB) over optics (using 800 G optical transceivers). Note that the PHY of the NVLink switch ports is compatible with both Infiniband and Ethernet (400 Gbps). i.e. to each NVSwitch, four ports (of the 72 ports) from each server are connected, therefore from 32 servers x 4 ports per server = 128 ports.

PERFORMANCE ANALYSIS
DGX H100 SERVER (DGXS)
4 PFLOPS (FP8) per H100 x 8 units of H100s = 32 PFLOPS
Bisection Bandwidth = 3.6 TB/sec

DGX H100 SUPERPOD
32 PFLOPS per DGXS x 32 units of DGXS = 1024 PFLOPS = 1 EXAFLOPS
Bisection Bandwidth = 57.6 TB/sec

--

--

Karthik G Vaithianathan

I am just a moment in the universe's time axis, trying to leave a better world behind!