Authors: Wojciech Kryściński, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, Richard Socher
Paper reference: https://arxiv.org/pdf/1908.08960.pdf

Contribution

This paper evaluates the research setup (up until Aug. 2019) of text summarization and highlights shortcomings from: (1) dataset, (2) evaluation metric, and (3) models. Through detailed experiments and human studies, this work suggests future work to add constraints while constructing datasets; diversify models' outputs and less fit to layout bias; design evaluation methods that reflect human judgements and take other dimensions such as factual consistency into account.

Details

Dataset

Under-constrained task

The paper conducts studies demonstrating the difficulty and ambiguity of content selection in text summarization. Assessing the importance of information is difficult since it highly depends on the expectations and prior knowledge of readers.

The paper shows that under current setting in which models are simply given a document with one associated reference summary (no additional information), the summarization task is under-constrained and is too ambiguous to be solved by end-to-end models. Moreover, due to the abstractive nature of human written summaries, similar content can be described in unique ways.

Bias in news data

Initial paragraphs contain the most information.

Noise in scraped datasets

Current summarization datasets are filled with noisy examples (links, code, placeholder texts, etc.)

Evaluation Metric

The current evaluation protocol depends primarily on the exact lexical overlap between reference and candidate summaries measured by ROUGE, but the correlation strength between ROUGE scores and human judgements is low.

No method explicitly examines the factual consistency of summaries.

Models

Performance of current models is relied too heavily on the layout bias of news corpora. But models should less fit to a particular domain bias.

The diversity of model outputs is low. Generated summaries from diverse models share a large part of the vocabulary on the token level, but differ on how they organize the tokens into longer phrases. Comparing results with the n-gram overlap between models and reference summaries shows a substantially higher overlap between any model pair than between the models and reference summaries.