Embedding-based Zero-shot Retrieval through Query Generation
Passage retrieval addresses the problem of locating relevant passages, usually from a large corpus, given a query. In practice, lexical term-matching algorithms like BM25 are popular choices for retrieval owing to their efficiency. However, term-based matching algorithms often miss relevant passages...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Passage retrieval addresses the problem of locating relevant passages,
usually from a large corpus, given a query. In practice, lexical term-matching
algorithms like BM25 are popular choices for retrieval owing to their
efficiency. However, term-based matching algorithms often miss relevant
passages that have no lexical overlap with the query and cannot be finetuned to
downstream datasets. In this work, we consider the embedding-based two-tower
architecture as our neural retrieval model. Since labeled data can be scarce
and because neural retrieval models require vast amounts of data to train, we
propose a novel method for generating synthetic training data for retrieval.
Our system produces remarkable results, significantly outperforming BM25 on 5
out of 6 datasets tested, by an average of 2.45 points for Recall@1. In some
cases, our model trained on synthetic data can even outperform the same model
trained on real data |
---|---|
DOI: | 10.48550/arxiv.2009.10270 |