A language-independent authorship attribution approach for author identification of text documents

•A new lazy classifier for the authorship attribution task.•A new similarity metric to calculate the similarity between documents.•A language-independent classifier without need to any NLP techniques.•Examining the effects of different classifiers and stylometric features on the authorship attributi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2021-10, Vol.180, p.115139, Article 115139
1. Verfasser:	Ramezani, Reza
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Author identification Authorship Authorship attribution Classifiers Datasets Evaluation Inverse document frequency NLP Short message service Statistical methods Term frequency Text similarity
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•A new lazy classifier for the authorship attribution task.•A new similarity metric to calculate the similarity between documents.•A language-independent classifier without need to any NLP techniques.•Examining the effects of different classifiers and stylometric features on the authorship attribution accuracy. In the Authorship Attribution (AA) task, the most likely author of textual documents, such as books, papers, news, and text messages and posts are identified using statistical and computational methods. In this paper, a new computational approach is presented for identifying the most likely author of text documents. The proposed solution emphasizes lazy profile-based classification and, by using the Term Frequency-Inverse Document Frequency (TF_IDF) scheme, introduces a new measure for identifying important terms of documents. The importance of the terms is then used to calculate the similarity between an anonymous document and known documents. The proposed solution works with raw text documents and does not require any NLP tools for preprocessing, which makes it language-independent. The efficiency of the proposed solution has been evaluated by conducting several experiments on two English and Persian datasets, each of which contains six corpora with different number of authors. The obtained results demonstrate that the proposed solution outperforms state-of-the-art stylometric features, employed by seven well-known classifiers, by obtaining 0.902 accuracy for the English dataset and 0.931 accuracy for the Persian dataset. In addition, supplementary experiments have been conducted to evaluate the effects of documents’ length on the accuracy of the proposed solution, to examine the computation time of the proposed solution and competitive classifiers, and to identify the most effective stylometric features and classifiers.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2021.115139