Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
Neural information retrieval (IR) systems have progressed rapidly in recent years, in large part due to the release of publicly available benchmarking tasks. Unfortunately, some dimensions of this progress are illusory: the majority of the popular IR benchmarks today focus exclusively on downstream...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Neural information retrieval (IR) systems have progressed rapidly in recent
years, in large part due to the release of publicly available benchmarking
tasks. Unfortunately, some dimensions of this progress are illusory: the
majority of the popular IR benchmarks today focus exclusively on downstream
task accuracy and thus conceal the costs incurred by systems that trade away
efficiency for quality. Latency, hardware cost, and other efficiency
considerations are paramount to the deployment of IR systems in user-facing
settings. We propose that IR benchmarks structure their evaluation methodology
to include not only metrics of accuracy, but also efficiency considerations
such as a query latency and the corresponding cost budget for a reproducible
hardware setting. For the popular IR benchmarks MS MARCO and XOR-TyDi, we show
how the best choice of IR system varies according to how these efficiency
considerations are chosen and weighed. We hope that future benchmarks will
adopt these guidelines toward more holistic IR evaluation. |
---|---|
DOI: | 10.48550/arxiv.2212.01340 |