Authors: Yang Liu, Sheng Shen, Mirella Lapata
Paper reference: https://aclanthology.org/2021.naacl-main.56.pdf
Contribution
Single reference dataset with maximum-likelihood training might not be optimal for summarization as there can be multiple valid summaries for a source input. This paper alleviates such problems by the use of self-knowledge distillation, where the teacher and student have identical neural network architectures.
With self-knowledge distillation, teacher outputs can be viewed as an enrichment of the single reference setting (multiple summaries) to prevent the student from becoming overconfident in its predictions (add uncertainty).
The paper shows that self-knowledge distillation (1) improves over teacher models in both pretrained and non-pretrained settings; and (2) leads the student significantly more succinct, informative, and factual consistent compared to the teacher at the expense of fluency. The injection of noise brings further improvements.
Details
Add Noise
The paper designs different noise mechanisms for the teacher and student.
Noisy teacher: to inject noise into the distillation signals. It applies teacher dropout mechanism while generating teacher predictions for training the student.
Noisy student: to inject noise into the training data. It perturbs the source document by (1) dropping words; (2) replacing synonyms; and (3) dropping sentences.
The training objective $$ \mathcal{L}_{\mathrm{FINAL}}=(1-\lambda) \mathcal{L}_{\mathrm{NLL}}+\lambda \mathcal{L}_{\mathrm{KD}} $$ is standard negative log likelihood plus a knowledge distillation loss (to imitate the teacher’s outputs) where
$$ \mathcal{L}_{\mathrm{KD}}=\sum_{t=1}^{T} \mathrm{KL}\left(\tilde{p}_{T}^{\alpha}\left(y_{t} \mid y_{1}^{t-1}, x\right), p_{S}\left(y_{t} \mid y_{1}^{t-1}, \tilde{x}\right)\right),\ $$
$\tilde{p}_{T}^{\alpha}$ indicates the predictions from the teacher model with active dropout $\alpha$ and $\tilde{x}$ is perturbed source input.