Improved TFIDF in big news retrieval: An empirical study

•The terms are assessed with considering distances between documents.•We create a two-stage algorithm for refining the terms’ weights.•Distance learning is the main feature of our proposed methods.•We analyze Reuters news with respect to text classification and clustering. Thomson Reuters news artic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition letters 2017-07, Vol.93, p.113-122
1. Verfasser:	Chen, Chien-Hsing
Format:	Artikel
Sprache:	eng
Schlagworte:	Big news Classification Clustering Collection Empirical analysis Machine learning News News classification News clustering Newsprint Retrieval Term weighting Two-stage learning Weight Weighting
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•The terms are assessed with considering distances between documents.•We create a two-stage algorithm for refining the terms’ weights.•Distance learning is the main feature of our proposed methods.•We analyze Reuters news with respect to text classification and clustering. Thomson Reuters news articles have been considered integral data sources that have given rise to several inspiring applications of text classification and clustering. The most well-known term weighting approach, the term frequency-inversedocument frequency (TFIDF) method, is often used to assign term weights that support such applications. Thomson Reuters reports pertinent incoming news (e.g., the refugee crisis in Europe) over a given period of time, and the most prominent terms (e.g., “refugee”) are thus frequently found in a large collection of news stories. When term weights are measured via the TFIDF method, such weights must be heavily compromised while the collection of news is sufficiently large. As the TFIDF approach is vulnerable to biases because the most important terms are typically referred to as noise, thus leading lower term weights, news retrieval without the use of the most important terms is difficult and ineffective. We thus present a new distance-based term weighting method for overcoming this bias by considering a basic characteristic whereby each news article must be similar or different from others while processing big news that include large amounts of news. All news must not be considered to contribute equally to the weighting of a particular term. In this study, the weight of a particular term is assessed based on its distance in an article to other instances of the same term, and this weight is highly sensitive to whether similar articles cause a term to occur and to whether different articles cause a term to disappear. The most important terms are thus delivered in large news corpora when studying similarities between news stories. In addition, we create a two-stage learning algorithm to refine the term's weights, and we develop an intelligent model that applies our term weighting method to Reuters news analyses based upon classification and clustering problems. The experimental results show that our methods perform better performance than TFIDF in terms of news classification and clustering.
ISSN:	0167-8655 1872-7344
DOI:	10.1016/j.patrec.2016.11.004