Enhancing Dense Retrievers' Robustness with Group-level Reweighting
The anchor-document data derived from web graphs offers a wealth of paired information for training dense retrieval models in an unsupervised manner. However, unsupervised data contains diverse patterns across the web graph and often exhibits significant imbalance, leading to suboptimal performance...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The anchor-document data derived from web graphs offers a wealth of paired
information for training dense retrieval models in an unsupervised manner.
However, unsupervised data contains diverse patterns across the web graph and
often exhibits significant imbalance, leading to suboptimal performance in
underrepresented or difficult groups. In this paper, we introduce WebDRO, an
efficient approach for clustering the web graph data and optimizing group
weights to enhance the robustness of dense retrieval models. Initially, we
build an embedding model for clustering anchor-document pairs. Specifically, we
contrastively train the embedding model for link prediction, which guides the
embedding model in capturing the document features behind the web graph links.
Subsequently, we employ the group distributional robust optimization to
recalibrate the weights across different clusters of anchor-document pairs
during training retrieval models. During training, we direct the model to
assign higher weights to clusters with higher loss and focus more on worst-case
scenarios. This approach ensures that the model has strong generalization
ability on all data patterns. Our experiments on MS MARCO and BEIR demonstrate
that our method can effectively improve retrieval performance in unsupervised
training and finetuning settings. Further analysis confirms the stability and
validity of group weights learned by WebDRO. The code of this paper can be
obtained from https://github.com/Hanpx20/GroupDRO_Dense_Retrieval. |
---|---|
DOI: | 10.48550/arxiv.2310.16605 |