Classifying encyclopedia articles: Comparing machine and deep learning methods and exploring their predictions
This article presents a comparative study of supervised classification approaches applied to the automatic classification of encyclopedia articles written in French. Our dataset includes all 70k text articles from Diderot and d’Alembert’s Encyclopédie (1751-72). In a two-task experiment we test com...
Gespeichert in:
Veröffentlicht in: | Data & knowledge engineering 2022-11, Vol.142, p.102098, Article 102098 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This article presents a comparative study of supervised classification approaches applied to the automatic classification of encyclopedia articles written in French. Our dataset includes all 70k text articles from Diderot and d’Alembert’s Encyclopédie (1751-72). In a two-task experiment we test combinations of (1) text vectorization methods (bags-of-words and word embeddings) and (2) traditional Machine Learning and newer Deep Learning classification methods (including transformer architectures). In addition to evaluating each approach, we review the results quantitatively and qualitatively. The best model obtains an average F-score of 86% for 38 classes. Using network analysis, we highlight the difficulty of labeling semantically close classes. We also discuss misclassifications in order to understand the relationship between content and different ways of ordering knowledge. We openly release all code and results, and data is available on request.11https://gitlab.liris.cnrs.fr/geode/EDdA-Classification. |
---|---|
ISSN: | 0169-023X 1872-6933 |
DOI: | 10.1016/j.datak.2022.102098 |