BERT and RoBERTa
BERT Recap Overview Bert (Bidirectional Encoder Representations from Transformers) uses a “masked language model” to randomly mask some tokens from the input and predict the original vocabulary id of the masked token. Bert shows that “pre-trained representations reduce the need for many heavily-engineered task-specific architectures”. BERT Specifics There are two steps to the BERT framework: pre-training and fine-tuning During pre training, the model is trained on unlabeled data over different pre-training tasks....