The effect of clustering algorithms on question answering

Question answering (QA) is one of the essential fields in information retrieval where specific answers are provided instead of large documents. The relations among questions and answers are determined using natural language processing techniques while clustering algorithms can be helpful in improvin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2024-06, Vol.243, p.122959, Article 122959
Hauptverfasser:	AlMahmoud, Rana Husni, Alian, Marwah
Format:	Artikel
Sprache:	eng
Schlagworte:	Arabic language Clustering Question answering Similarity measures
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Question answering (QA) is one of the essential fields in information retrieval where specific answers are provided instead of large documents. The relations among questions and answers are determined using natural language processing techniques while clustering algorithms can be helpful in improving the effectiveness of result retrieval by reducing the amount of required comparisons for a specific question or answer. In this work, we introduce a clustering-based approach for a QA system. This approach groups related questions into clusters using different clustering algorithms, specifies the appropriate answer using similarity methods between the answers and the generated clusters, and then assigns answers to their most related questions. Different clustering algorithms, such as k-means, spherical k-means, single-linkage hierarchical clustering (SLHA), unweighted pair group method with arithmetic mean (UPGMA), expectation–maximization (EM), and clustering Arabic documents based on bond energy (CADBE), are tested. The effectiveness of a clustering algorithm is investigated with respect to certain factors, including number of clusters, text representation, similarity measure between answers and clusters, and similarity measure between answers and questions in a selected cluster. In addition, a comprehensive ranking system is introduced to evaluate the performance of clustering algorithms. Evaluation is performed using the Dataset of Arabic Why Question Answering System (DAWQAS) and the Multilingual Question Answering (MLQA) dataset. Results show that CADBE achieves the highest accuracy and the first rank, followed by SLHA and UPGMA, while spherical k-means has the lowest rank. The performance of clustering algorithms for MLQA dataset is affected by its characteristics, such as short questions, long and varied answers, and diverse subject domains. Unigram and bigram intersection measures perform well in most cases. Term frequency inverse document frequency representation outperforms word embedding in DAWQAS. Overall, the experiments provide insights into the performance of clustering algorithms in QA systems. •A clustering-based QA system groups related questions, selects answer via similarity.•Assigning Answers to Related Questions Using Various Similarity Methods.•Exploring certain factors to investigate effectiveness of clustering algorithm.•A comprehensive ranking system evaluates the performance of clustering algorithms.•CADBE achieves highest accuracy,
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2023.122959