Toward any-language zero-shot topic classification of textual documents

In this paper, we present a zero-shot classification approach to document classification in any language into topics which can be described by English keywords. This is done by embedding both labels and documents into a shared semantic space that allows one to compute meaningful semantic similarity...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Artificial intelligence 2019-09, Vol.274, p.133-150
Hauptverfasser:	Song, Yangqiu, Upadhyay, Shyam, Peng, Haoruo, Mayhew, Stephen, Roth, Dan
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Classification Cross-lingual text classification Embedding Encyclopedias Language Languages Machine translation Mapping Multilingual text classification Semantic Supervision Semantics Similarity Zero-shot text classification
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this paper, we present a zero-shot classification approach to document classification in any language into topics which can be described by English keywords. This is done by embedding both labels and documents into a shared semantic space that allows one to compute meaningful semantic similarity between a document and a potential label. The embedding space can be created by either mapping into a Wikipedia-based semantic representation or learning cross-lingual embeddings. But if the Wikipedia in the target language is small or there is not enough training corpus to train a good embedding space for low-resource languages, then performance can suffer. Thus, for low-resource languages, we further use a word-level dictionary to convert documents into a high-resource language, and then perform classification based on the high-resource language. This approach can be applied to thousands of languages, which can be contrasted with machine translation, which is a supervision-heavy approach feasible for about 100 languages. We also develop a ranking algorithm that makes use of language similarity metrics to automatically select a good pivot or bridging high-resource language, and show that this significantly improves classification of low-resource language documents, performing comparably to the best bridge possible.
ISSN:	0004-3702 1872-7921
DOI:	10.1016/j.artint.2019.02.002