Authors: Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald
Paper reference: https://aclanthology.org/2020.acl-main.173.pdf
Contributions
This paper conducts human evaluations to better understand the types of hallucinations in abstractive summarization. Human evaluations show that pre-trained models are better summarizers in generating faithful and factual summaries. Experiments shows that textual entailment measures better correlate with faithfulness and assess the overall quality of summaries than standard metrics such as ROUGE and BERTScore, which are indicators of informativeness of summaries.
Details
Hallucinations in Summarization
- Factual hallucination: contains information not found in the input document but is factually correct. This may happen due to world knowledge stored during model’s pretraining.
- Intrinsic hallucination: hallucinates by using terms or concepts from the document but misrepresent information from the document (can be factual hallucination). The intrinsic hallucination reveals models’ tendency to misrepresent information in the document due to the lack of document-level understanding and inference.
- Extrinsic hallucination: hallucinates by adding information not introduced the input document (can be factual hallucination).
Hallucinations are not necessarily erroneous, but most hallucinations are erroneous.
Pretraining improves faithfulness
Pretraining prepares models more aware of the domain of the document, models are more confident in predicting tokens from the document, hence, improving faithfulness.
Automatic measures for faithfulness
The paper further studies the extent of hallucination by two semantic inference related measures: (1) textual entailment, and (2) question answering. The result shows that the textual entailment scores are best correlated with both faithful and factual human scores.
Experiments suggest that textual entailment could be used (1) as an automatic measure for faithfulness, (2) as a model selection objective: use the probability that a summary is entailed by a document as a selection criteria to select a summary.
Another way of measure faithfulness is to directly train a model explicitly to predict faithfulness. The paper shows that this does slightly improve the ability to select faithful summaries.