Information Retrieval in long documents: Word clustering approach for improving Semantics
In this paper, we propose an alternative to deep neural networks for semantic information retrieval for the case of long documents. This new approach exploiting clustering techniques to take into account the meaning of words in Information Retrieval systems targeting long as well as short documents....
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we propose an alternative to deep neural networks for semantic
information retrieval for the case of long documents. This new approach
exploiting clustering techniques to take into account the meaning of words in
Information Retrieval systems targeting long as well as short documents. This
approach uses a specially designed clustering algorithm to group words with
similar meanings into clusters. The dual representation (lexical and semantic)
of documents and queries is based on the vector space model proposed by Gerard
Salton in the vector space constituted by the formed clusters. The
originalities of our proposal are at several levels: first, we propose an
efficient algorithm for the construction of clusters of semantically close
words using word embedding as input, then we define a formula for weighting
these clusters, and then we propose a function allowing to combine efficiently
the meanings of words with a lexical model widely used in Information
Retrieval. The evaluation of our proposal in three contexts with two different
datasets SQuAD and TREC-CAR has shown that is significantly improves the
classical approaches only based on the keywords without degrading the lexical
aspect. |
---|---|
DOI: | 10.48550/arxiv.2302.10150 |