From Exascale, towards building Zettascale general purpose & AI Supercomputers

Karthik G Vaithianathan
8 min readMay 17, 2023

--

Recently, I saw an interesting slide (shown above) presented by AMD’s CTO Mark Papermaster, showing a need for a 40x improvement in energy efficiency in computing by 2035. In spite of such energy efficiency, the total power consumed by a zettascale-capable data center would be 500 MW (would need half of a nuclear plant’s output).

The Analysis

As per Green500’s 20th green list published in June 2022, (https://www.top500.org/lists/green500/2022/11/), the largest supercomputer (Exascale) was built by Frontier using AMD’s EPYC 64C 2GHz core and AMD Instinct MI250X Slingshot-11 GPU and is currently hosted in the Department of Energy (DOE)’s Oak Ridge National Laboratory, USA.

This supercomputer has 8.73 million CPU/GPU cores with a performance of 1.1 Exaflops and consumes 21 megawatts of power at an average energy efficiency of 52 GFlops/Watt. The projected supercomputer in 2035 that will deliver Zettascale performance would consume 500 megawatts of power at an energy efficiency of 2140 GFlops/watt, needing a 40x projected improvement over 13 years.

Note that the performance unit is FP64, not int8/TF8/TF16 while the metric AI accelerators commonly use is TOPS (Tera integer Operations Per Second), not FP64.

The top-of-bin EPYC 7763 part comes in a 225–280 watt TDP and provides 3.58 teraflops of peak double-precision performance running at max boost frequency of 3.5 GHz — over 7 teraflops in a dual-socket server (Reference — https://www.hpcwire.com/2021/03/15/amd-launches-epyc-milan-with-19-skus-for-hpc-enterprise-and-hyperscale). AMD CPU EPYC’s energy efficiency is 12 GFlops/watt.

AMD Instinct MI250X Slingshot-11 GPU’s double precision performance is 47.9 TFLOPs. (Reference — https://www.amd.com/en/products/server-accelerators/instinct-mi250x) at TDP of 560 watts. MI250’s energy efficiency is 85 GFlops/watt while the integer performance is 383 TOPS at 6ergy efficiency.

Since the Dennard Scaling stopped around 2007 (clock frequency could not be increased anymore), the world has been in search of parallel workloads to make use of the raw flops available with increasing transistor density. Except for graphics and video, which don’t impact many other industries, there was none barring a few highly threaded, big-data ones. The new era of AI has come as a boon as the deep learning networks which now form the core of AI are naturally parallel (in two forms — data parallel and model parallel). Given AI is expected to impact almost all industries and market verticals, from now on, pundits predict the supercomputing roadmap shall largely be dominated by AI workloads.

The Data Center Design Space

The implementation design space for computing systems is generally made of power, performance, and area (cost). However, in the case of data centers where the capex and opex are significant, an additional variable namely the total cost of ownership (TCO) comes into play. The two major contributors to TCO are (a) recurring energy bills and (b) Rackspace rent in a secure place colocated to a high bandwidth Internet POP. Even though the energy bill is indirectly covered by the power variable, it is important to note that over 70% of the energy bill goes towards cooling i.e., air conditioning and cooling systems in order to dissipate heat. To the question of what is the maximum computing density in terms of square foot area, or in other words, compute/sqft, an interesting point to note is that the majority of older data centers are designed to power and cool up to 20 KW per server rack. This puts a limit on the vertical stacking of servers.

Technology Trends

Now, let’s take a step back and look at the technology trends along the three primary vectors — compute, memory and interconnect. The below graph shows flops, memory, and interconnect scaling over the last 25 years (source) and it is obvious that both the memory and the interconnect bandwidth scaling are lagging behind the FLOPS growth.

While the CMOS logic has been following Moores’ law which resulted in a trend of tripling logic flops every two years, memory and interconnect bandwidths have not been growing at the same rate. This is one of the main reasons for low application level flops/watt performance i.e., most of the DNN accelerators are limited by either memory or interconnect or both. The following table (reference: Google’s OCS paper) shows the ultra-low utilization of the Google TPUs (v3 and v4) for various DNN models on both inference and training workloads supporting the above argument.

From the workload perspective, if we were to analyze the parallel implementations of training algorithms of DNN models, the below graph shows (reference: PipeDream) the communication overhead as a percentage of the total training time of DNN models for different hardware configurations. Many models (AlexNet, VGG16, S2VT) have a high communication overhead, even on the relatively slow K80 GPUs. Two factors contribute to an increase in the communication overhead across all models: (i) an increase in the number of data-parallel workers, and (ii) an increase in GPU compute capacity.

What’s happening on the memory front?

Over the past two decades, since the success of double data rate SDRAMs and the failure of RDRAMs (back in the old Pentium days), the bus architecture has largely remained static except for the width of the data bus. The chip-to-device bus continues to be a parallel bus with evolution in the per-wire bandwidth. Particularly, the DDR has evolved into three forms — DDRx, LPDDRx, and HBMx. Particularly in the high-performance computing space, HBM3 with its wide parallel bus has become popular. Per wire signalling bandwidth of DDR phy layer of HBM3 is 6.4 Gbps and Rambus’s recent version of HBM3 (below figure) has pushed this to 8.4 Gbps per pin (reference: HBM3 roadmap). With 1024 wide bus, memory throughput has already reached 1.05 TB/sec.

What’s happening on the interconnect front?

At the system architecture level, interconnect can be looked at as a) on-die b) die-to-die within a package c) chip-to-chip (board level) d) inter rack, and e) inter-data center. While (a) and (b) are in general parallel buses (with side band signaling and protocols for routing & switching), (c ) and (d) are primarily serial interconnects. As we know, serial interconnect’s bandwidth performance in general is driven by the performance of SERDES. The trend in SERDES or signaling rate is shown in the below table (reference: Towards 1.6 Tbps Ethernet). Today, both Nvidia’s NVLINK4 and 400G ethernet use 100 Gbps as the max signalling rate per pair of copper wires (PHY). NVLINK4 link is made of four lanes with total bidirectional bandwidth of 400 Gbps or 50 GB/sec. Each lane is capable of 50 GT/sec @ PAM-4 (2-bit encoding), giving a bandwidth of 100 Gbps per direction.

The future of interconnect lies in optics/photonics where the design space includes the number of wavelengths used for WDM, modulation technique, polarization, number of wavelengths per waveguide, and the ability to convert from signal from electrical to light at high switching rates. Academia and startups are exploring rates starting from 800 Gbps to multiple Terabits per second per channel using multiple wavelengths.

A Brief Look at GPUs and Server-Grade ML SoCs that can break the Exascale Barrier

Nvidia Hopper GPU H100

Nvidia’s latest Hopper GPU H100-based x2 configuration’s peak double precision performance is 68 TFLOPs at a TDP of 800W. H100’s energy efficiency (General purpose, FP64), similar to MI250 is at 85 GFlops/watt. Integer performance is 4 Peta Operations Per Second (POPS) and integer energy efficiency is 5.71 TOPS/watt. H100 DGX, the server version that uses x8 H100’s power consumption is 10.2 KW including the x2 Intel’s Sapphire Rapids XEON servers. Die size of 814 sq mm @ 4 nm, with chip-to-chip bandwidth at 900 GB/sec.

Google’s TPU v4

The latest Google TPU v4 has a peak performance of 275 TFLOPS (bf16) or TOPS (int8) at an approximate TDP of 200 watts, resulting in 1.375 TOPS/watt. Die size of 600 sq mm @ 7 nm, with 300 GB/sec chip-to-chip bandwidth.

Groq’s GPU

Facebook’s MTIA: 64x64 Tile of PEs, 128 MB on-chip SRAM, TSMC 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision at TDP of 25 W. By integer ops, power efficiency is 4 TOPS/watt and by double precision FLOP metric, the energy efficiency is approximately 512 GFlops/watt.

Graphcore's IPU MK2

Graphcore’s MK2 IPU integrates 1472 processor tiles with a record 896Mbits of on-chip SRAM at a die area of 823 sq mm on TSMC 7 nm process node. Running with a 1.325GHz clock, the chip consumes 300W and can process a peak of 250 Gflop/s of 16 bit floating point operations for machine learning. The below table (from slide deck presented by their CTO in 2021) shows IPU MK2’s energy efficiecy at 570 GFlops/watt.

Qualcomm’s AI100

Optimized for the edge market, AI100 is the 16-core version at 400 TOPS@ 75W TDP, running at 2.1 GHz on TSMC 7 nm with energy efficiency at 5.33 TOPS/watt.

Untether’s Boqueria

Boqueria is an In Memory Compute architecture-based inference accelerator from Untether capable of 2 Peta Operations of INT8/FP8 using 1485 RISCV cores, 238MB SRAM running at 1.35 GHz, giving 30 TOPs/watt TSMC 7nm.

Tesla’s Dojo

The Dojo D1 die runs at 2 GHz and has a total of 440 MB of SRAM across those cores, and delivers 376 teraflops at BF16 or CFP8 and 22 teraflops at FP32. Manufactured in a 7nm process, it has a total of 50,000 million transistors and occupies an area of ​​645 sq mm at a TDP of 400 watts. This technically puts the energy efficiency at 940 TOPS/watt or 55 GFlops/watt of FP32.

Ampere One A192

ARM ISA Cores, 192 cores at 434 watts TDP, have not announced FLOPs performance instead want to compare in terms of the number of VMs per rack and go on to compare AI performance (Generative AI & DLRM) with AMD CPUs.

AWS Graviton3

64 ARMv8.5, 5nm TSMC, TDP is 100 W, performance given as ECU (elastic compute unit) at 4.4 ECU/watt. Not many details are publicly available on TFLOPS/watt.

References:

https://www.top500.org/site/48553/

https://www.top500.org/lists/green500/2022/11/

https://www.top500.org/system/180047/

https://www.olcf.ornl.gov/frontier/

https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf

--

--

Karthik G Vaithianathan

I am just a moment in the universe's time axis, trying to leave a better world behind!