Authors: Thibault Sellam, Dipanjan Das, Ankur P. Parikh
Paper reference: https://aclanthology.org/2020.acl-main.704.pdf

Contribution

This paper proposes BLEURT, a learned evaluation metric based on BERT that can model human judgments. It develops a novel pre-training scheme that uses millions of synthetic examples to improve model’s generalizability and robustness.

BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) provides state-of-the art performance for the WMT Metrics Shared task (2017-2019). The pre-trainig scheme (1) yields superior results even when the training data is scarce; and (2) makes BLEURT significantly more robust to quality drifts (out-of-distribution data) and quickly adapt to the new tasks.

Details

WMT Metrics Shared Task is an annual benchmark in which translation metrics are compared on their ability to imitate human assessments.

Pre-training on Synthetic Data

In general, the paper generates a large number of of synthetic reference-candidate pairs $(z, \tilde{z})$, and trains a BERT on several supervision signals with a multi-task loss. BLEURT models are trained in three steps: (1) regular BERT pre-training; (2) second phase pre-training on synthetic data; and (3) fine-tuning on task-specific ratings.

Generating Sentence Pairs

This paper generates synthetic sentence pairs $(z, \tilde{z})$ by randomly perturbing 1.8 million segments $z$ from Wikipedia wit three techniques: mask-filling with BERT, backtranslation, and randomly dropping out words.

Pre-training Signals

For each sentence pair, the model calculates a weighted loss from a set of pre-training signals, including:
(1) Automatic Metrics: BLEU, ROUGE, BERTscore.
(2) Backtranslation Likelihood.
(3) Textual Extailment.
(4) Backtranslation Flag.