Towards Question-based High-recall Information Retrieval: Locating the Last Few Relevant Documents for Technology-assisted Reviews
While continuous active learning algorithms have proven effective in finding most of the relevant documents in a collection, the cost for locating the last few remains high for applications such as Technology-assisted Reviews (TAR). To locate these last few but significant documents efficiently, Zou...
Gespeichert in:
Veröffentlicht in: | ACM transactions on information systems 2020-07, Vol.38 (3), p.1-35 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | While continuous active learning algorithms have proven effective in finding most of the relevant documents in a collection, the cost for locating the last few remains high for applications such as Technology-assisted Reviews (TAR). To locate these last few but significant documents efficiently, Zou et al. [2018] have proposed a novel interactive algorithm. The algorithm is based on constructing questions about the presence or absence of entities in the missing relevant documents. The hypothesis made is that entities play a central role in documents carrying key information and that the users are able to answer questions about the presence or absence of an entity in the missing relevance documents. Based on this, a Sequential Bayesian Search-based approach that selects the optimal sequence of questions to ask was devised. In this work, we extend Zou et al. [2018] by (a) investigating the noise tolerance of the proposed algorithm; (b) proposing an alternative objective function to optimize, which accounts for user “erroneous” answers; (c) proposing a method that sequentially decides the best point to stop asking questions to the user; and (d) conducting a small user study to validate some of the assumptions made by Zou et al. [2018]. Furthermore, all experiments are extended to demonstrate the effectiveness of the proposed algorithms not only in the phase of abstract appraisal (i.e., finding the abstracts of potentially relevant documents in a collection) but also finding the documents to be included in the review (i.e., finding the subset of those relevant abstracts for which the article remains relevant). The experimental results demonstrate that the proposed algorithms can greatly improve performance, requiring reviewing fewer irrelevant documents to find the last relevant ones compared to state-of-the-art methods, even in the case of noisy answers. Further, they show that our algorithm learns to stop asking questions at the right time. Last, we conduct a small user study involving an expert reviewer. The user study validates some of the assumptions made in this work regarding the user’s willingness to answer the system questions and the extent of it, as well as the ability of the user to answer these questions. |
---|---|
ISSN: | 1046-8188 1558-2868 |
DOI: | 10.1145/3388640 |