Background
-
GPT3 paper talks about the need for large number (thousands to 100s of thousands) of truth samples to fine tune language model to perform specific tasks; in contrast to human who can learn from just a few examples. Truth labels are expensive, labor intensive and time consuming to acquire. GPT3 has demonstrated its amazing capability of many nlp tasks (translation, QA, reasoning (textual entailment), domain adaptation) without extensive fine-tuning (few-shot).
-
Megatron-LM 3/20 is a predecessor of GPT3 sporting 8B parameters, trained on 512GPUs.
Training Parallelism in Large Language Models
- intra-layer model parallelism
- careful attention to the placement of layer normalization in BERT-like models is important
- pipeline model parallelism