Segmented Document Classification: Problem and Solution

In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like “title”,“body”, etc. We call them “segmented documents”. To classify segmented documents, we can...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Guo, Hang, Zhou, Lizhu
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Applied sciences Artificial intelligence Computer science control theory systems Computer systems and distributed systems. User interface Exact sciences and technology False Rate Information systems. Data bases Memory organisation. Data processing Plain Text Semistructured Data Software Speech and sound recognition and synthesis. Linguistics Text Categorization Problem Training Document
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like “title”,“body”, etc. We call them “segmented documents”. To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN_MIX and OUT_MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaiveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.
ISSN:	0302-9743 1611-3349
DOI:	10.1007/11827405_53