Active Learning for News Article's Authorship Identification

Over time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2023, Vol.11, p.98415-98426
Hauptverfasser:	Abbas, Sidra, Alsubai, Shtwai, Sampedro, Gabriel Avelino, Abisado, Mideth, Almadhor, Ahmad S., Kryvinska, Natalia, Zaidi, Monji Mohamed
Format:	Artikel
Sprache:	eng
Schlagworte:	Active learning Analytical models Authorship authorship identification Datasets Deep learning Feature extraction Forensics Learning Machine learning Machine learning algorithms Multilayer perceptrons News news articles Publishing Support vector machines Text analysis Writing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Over time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has to be developed to accurately determine the author of unknown texts, given a group of writing samples. Active Learning is utilized in this study because it iteratively selects the most informative samples to include in the training set, which enables a more precise and accurate authorship identification approach with fewer examples. Makes it useful for analyzing the rising amount of anonymous content and identifying unknown text authors. This study proposes a novel approach that utilizes active learning (AL) based machine models, namely Logistic Regression (AL-LR), Random Forest (AL-RF), XGboost (AL-XGB), and Multilayer Perceptron (AL-MLP) for authorship identification. The proposed approach extracts valuable characteristics of the writer using the Term Frequency-Inverse Document Frequency (TF-IDF). This study's selected comprehensive dataset, "All the news," is divided into three subsets: Article 1, Article 2, and Article 3. We have restricted the dataset's scope and selected the top 50 authors for our experimentation. The experimental outcomes reveal that the proposed AL-XGB model achieves superior performance on Article 1 of the "All the news" dataset. Further, the AL-LR model performed well on Article 2, and the AL-MLP performed well on Article 3. The results suggest using the proposed approach for authorship identification.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2023.3310813