Index - Simon Chu

Language Model Progression

GPT3 paper talks about the need for large number (thousands to 100s of thousands) of truth samples to fine tune language model to perform specific tasks; in contrast to human who can learn from just a few examples. Truth labels are expensive, labor intensive and time consuming to acquire. GPT3 has demonstrated its amazing capability of many nlp tasks (translation, QA, reasoning (textual entailment), domain adaptation) without extensive fine-tuning (few-shot).
Megatron-LM 3/20 is a predecessor of GPT3 sporting 8B parameters, trained on 512GPUs.
Megatron Repo for training model of different sizes
Google GLaM 12/10/21

intra-layer model parallelism
- careful attention to the placement of layer normalization in BERT-like models is important
pipeline model parallelism