Authors: Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed
Paper reference: https://arxiv.org/pdf/2007.14062v2.pdf

Contribution

This paper proposes a sparse attention mechanism, BigBird, that applys sparse, global, and random attention to approximate full attention model such as BERT and RoBERTa, and reduces the quadratic dependency to linear.

With this modification of attention mechanism, the model run under similar hardware can now handle longer text up to a length of 4096 at a much lower computational cost. As a consequence of the capability to handle longer context, BigBird has achieved state of the art results on various long document NLP tasks, such as question answering and summarization on a number of different datasets.

Details

Previous methods in handling long text

There are two lines of works in handling long text:
(1) Use some mechanisms to select a smaller subset of relevant contexts to feed in the model and optionally iterate. The paper mentions that these methods often require significant engineering efforts and are hard to train.
(2) Come up with approaches that do not require full attention to reduce the memory and computation requirements since full attention may not be necessary.

Drawback: with these methods, the same model architecture do not attain SoTA on multiple standard benchmarks.

BigBird

The paper explains the proposed sparse attention mechanism with graphs. The BigBird model consists of three main parts:

(a, For Long-range dependencies) All tokens attending to a set of $r$ random tokens.

With random graph construction, a random graph can approximate the complete graph spectrally.

(b, For Local dependencies) All tokens attending to a set of $w$ local neighboring tokens.

In the terminology of graph theory, clustering coefficient is a measure of locality of connectivity and small world graphs exhibit high clustering coefficient. With this in mind, to capture local structures in the context, the paper defines a sliding window attention, so that during self attention of width $w$, query at location $i$ attends from $i − w/2$ to $i + w/2$ keys.

(c, For Long-range dependencies) A set of $g$ global tokens attending on all parts of the sequence.

Experiments show that random blocks and local window were insufficient in capturing all the context necessary to compete with the performance of BERT. With theoretical analysis of Sparse Attention Mechanism, the paper utilizes “global tokens” to attend over the entire sequence.
There are two ways to define global tokens: (1) make some existing tokens “global”; (2) include additional “global” tokens such as CLS.

This way, each query token attends only to a subset of all possible tokens while yielding a good approximation of full attention.