How long will it take to train an LLM model like GPT-3?
Teraflop/s is 10¹² computations per second, Petaflop/s is 10¹⁵ computations per second, Exaflop/s is 10¹⁸ computations per second, and Zetaflop/s 10²¹ computations per second.
See below a table, in reference to the original GPT-3 paper, showing the time complexity of LLM models.
Above is one configuration made of 100 TOPS-GPU, 1000 of them, taking 30 days to complete the training with 300 billion tokens (assuming 100% utilization).
If we were to use this Exaflop machine, then the GPT-3 (175 billion parameters) with training complexity in the order of 3.14x10²³, will take 314,000 seconds of Exaflop machine’s compute cycles (assuming 100% utilization) to complete training with 300 billion tokens. 300,000 seconds is approximately 3.5 days.
How much DDR memory is required?
GPT-3 requires 175 billion parameters. If we allocate two bytes per parameter (FP16) means the total DDR memory required is 350 GB. Four bytes per parameter (FP32) means we need 700 GB of DDR memory.
Is it possible to reduce the model size without losing much accuracy?
Yes, generally there are two ways.
One approach is to quantize the weights to 8 bits (int8) or 4 bits (int4) which results in smaller storage 175 GB or 100 GB.
The second approach is to reduce the model parameters either by reducing the number of layers or by reducing the number of neurons per layer. This approach does have an impact on accuracy, precision, and recall of the model.