Authors: Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk
Paper reference: https://openreview.net/pdf?id=zeFrfgyZln

Contribution

This paper explores how to effectively select hard training negatives for dense text retrieval, which represents texts as dense vectors for approximate nearest neighbors (ANN) search. It theoretically shows the needs of harder negatives and the limitation of the widely used local (in-batch) negatives in contrastive learning. It therefore proposes a new mechanism called Approximate nearest neighbor Negative Contrastive Learning (ANCE) to select hard training negatives globally from the entire corpus, using an asynchronously updated ANN index.

The proposed ANCE achieves faster converge of model training and equally accurate text retrieval when compared with a number of baselines.

Details

Dense Retrievals

Dense Retrieval (DR) aims to overcome the sparse retrieval bottleneck by matching in a continuous representation space learned via neural networks. One challenge in dense retrieval is to construct proper negative instances when learning the representation space.

ANCE focuses on representation learning for dense retrieval and uses the ANN index to construct global hard negatives for contrastive learning. Specifically, it samples negatives from the top retrieved documents via the DR model from the ANN index. This is different from REALM, which focuses on grounded language modeling and uses the ANN index to find grounding documents.

Dense text retrieval has two phases:

  • Phase I: Learn a representation model to project semantically similar texts to vectors of large similarity scores via similarity functions such as inner products or cosine similarity scores.
  • Phase II: Adopt an approximate nearest neighbors (ANN) search algorithm to index these vectors and process queries.

This paper focuses on Phase I where it introduces a better negative sampling method to sample good dissimilar text pairs for training.