Systematic Evaluation of Neural Retrieval Models on the Touch\'e 2020 Argument Retrieval Subset of BEIR
The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touch\'e 2020, an argument retrieval task, neural retrieval models are...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The zero-shot effectiveness of neural retrieval models is often evaluated on
the BEIR benchmark -- a combination of different IR evaluation datasets.
Interestingly, previous studies found that particularly on the BEIR subset
Touch\'e 2020, an argument retrieval task, neural retrieval models are
considerably less effective than BM25. Still, so far, no further investigation
has been conducted on what makes argument retrieval so "special". To more
deeply analyze the respective potential limits of neural retrieval models, we
run a reproducibility study on the Touch\'e 2020 data. In our study, we focus
on two experiments: (i) a black-box evaluation (i.e., no model retraining),
incorporating a theoretical exploration using retrieval axioms, and (ii) a data
denoising evaluation involving post-hoc relevance judgments. Our black-box
evaluation reveals an inherent bias of neural models towards retrieving short
passages from the Touch\'e 2020 data, and we also find that quite a few of the
neural models' results are unjudged in the Touch\'e 2020 data. As many of the
short Touch\'e passages are not argumentative and thus non-relevant per se, and
as the missing judgments complicate fair comparison, we denoise the Touch\'e
2020 data by excluding very short passages (less than 20 words) and by
augmenting the unjudged data with post-hoc judgments following the Touch\'e
guidelines. On the denoised data, the effectiveness of the neural models
improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code
and the augmented Touch\'e 2020 dataset are available at
\url{https://github.com/castorini/touche-error-analysis}. |
---|---|
DOI: | 10.48550/arxiv.2407.07790 |