Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts
•Scalable algorithm based on bipartite networks to perform transduction.•Unlabeled data effectively employed to improve classification performance.•Better performance than algorithms based on vector space model or networks.•Rigorous evaluation to show the drawbacks of the existing transductive algor...
Gespeichert in:
Veröffentlicht in: | Information processing & management 2016-03, Vol.52 (2), p.217-257 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •Scalable algorithm based on bipartite networks to perform transduction.•Unlabeled data effectively employed to improve classification performance.•Better performance than algorithms based on vector space model or networks.•Rigorous evaluation to show the drawbacks of the existing transductive algorithms.•Trade-off analysis between inductive supervised and transductive classification.
Transductive classification is a useful way to classify texts when labeled training examples are insufficient. Several algorithms to perform transductive classification considering text collections represented in a vector space model have been proposed. However, the use of these algorithms is unfeasible in practical applications due to the independence assumption among instances or terms and the drawbacks of these algorithms. Network-based algorithms come up to avoid the drawbacks of the algorithms based on vector space model and to improve transductive classification. Networks are mostly used for label propagation, in which some labeled objects propagate their labels to other objects through the network connections. Bipartite networks are useful to represent text collections as networks and perform label propagation. The generation of this type of network avoids requirements such as collections with hyperlinks or citations, computation of similarities among all texts in the collection, as well as the setup of a number of parameters. In a bipartite heterogeneous network, objects correspond to documents and terms, and the connections are given by the occurrences of terms in documents. The label propagation is performed from documents to terms and then from terms to documents iteratively. Nevertheless, instead of using terms just as means of label propagation, in this article we propose the use of the bipartite network structure to define the relevance scores of terms for classes through an optimization process and then propagate these relevance scores to define labels for unlabeled documents. The new document labels are used to redefine the relevance scores of terms which consequently redefine the labels of unlabeled documents in an iterative process. We demonstrated that the proposed approach surpasses the algorithms for transductive classification based on vector space model or networks. Moreover, we demonstrated that the proposed algorithm effectively makes use of unlabeled documents to improve classification and it is faster than other transductive algorithms. |
---|---|
ISSN: | 0306-4573 1873-5371 |
DOI: | 10.1016/j.ipm.2015.07.004 |