Document‐to‐Document Retrieval Using Self‐Retrieval Learning and Automatic Keyword Extraction

In this study, we propose self‐retrieval learning, a self‐supervised learning method that does not require an annotated dataset. In self‐retrieval learning, keywords extracted from documents are used as queries to construct training data that imitate the relationship between query and corpus, such t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEJ transactions on electrical and electronic engineering 2025-01, Vol.20 (1), p.69-76
Hauptverfasser:	Seki, Yasuaki, Hamagami, Tomoki
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets document retrieval Documents DRMM Information retrieval Machine learning natural language processing Queries self‐supervised learning Supervised learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this study, we propose self‐retrieval learning, a self‐supervised learning method that does not require an annotated dataset. In self‐retrieval learning, keywords extracted from documents are used as queries to construct training data that imitate the relationship between query and corpus, such that the documents themselves are retrieved. In the usual supervised learning for information retrieval, a pair of query and corpus document is required as training data, but self‐retrieval learning does not require such data. In addition, it does not use information such as reference lists or other documents connected to the query, but only the text of the documents in the target domain. In our experiments, self‐retrieval learning was performed on the EU and UK legal document retrieval task using a retrieval model called DRMM. We found that self‐retrieval learning not only does not require supervised datasets, but also outperforms supervised learning with the same model in terms of retrieval accuracy. © 2024 Institute of Electrical Engineers of Japan and Wiley Periodicals LLC.
ISSN:	1931-4973 1931-4981
DOI:	10.1002/tee.24181