Authors: Alexander R. Fabbri, Simeng Han, Haoyuan Li, Haoran Li, Marjan Ghazvininejad, Shafiq Joty, Dragomir Radev, Yashar Mehdad
Paper reference: https://arxiv.org/pdf/2010.12836.pdf

Main Contributions

This work uses BART as a base pretrained model and empirically shows that the proposed method WikiTransfer, which creates auxiliary fine-tuning data from generic corpus (Wikipedia) by encoding characteristics of the target summarization dataset, improves zero-shot and few-shot summarization. The method shows further improvement in the few-shot settings when combined with proposed data augmentation and consistency regularization strategies.

Details

WikiTransfer Fine-tuning

The characteristics of the target summarization dataset includes the average length of input documents, the average summary length, and the level of abstraction of the desired summaries.

After auxiliary data (for a target domain) is created by the WikiTransfer method, a pre-trained model (BART) will fine-tune on this dataset-specific WikiTransfer data before transferring to the target domain. This allows a model finetuned on this data to learn characteristics of the target dataset to improve zero-shot and few-shot transfer of the model.

Data Augmentation via Round-Trip Translation

This work performs round-trip translation to generate paraphrases of both the source documents and summaries.

Given a dataset of $N$ data points, it translates the source and target sentencewise into a non-English language and keep the top $k$ beam hypotheses from beam search as output. Likewise for the back translation, resulting in $N * k^{2}$ augmented data points.

Consistency regularization

To balance learning from while not overfitting to the small number of supervised samples, the model must learn to be robust to small changes in input examples.

The work introduces a KL divergence loss to penalizes the model if the probability distribution of the output using the original input is far from the distribution using the round-trip translated input document.

Let $y$ be a target summary, $\hat{x}$ be a paraphrase of input document $x$ generated via round-trip translation. The KL divergence loss $L_{\text {cons }}(x, \hat{x}, y)$: $$ \sum_{t=1}^{m} K L (f (\cdot \mid y_{0: t-1}, x) | f (\cdot \mid y_{0: t-1,}, \hat{x}), \theta )) $$