Finding next of kin: Cross-lingual embedding spaces for related languages

Some languages have very few NLP resources, while many of them are closely related to better-resourced languages. This paper explores how the similarity between the languages can be utilised by porting resources from better- to lesser-resourced languages. The paper introduces a way of building a rep...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Natural language engineering 2020-03, Vol.26 (2), p.163-182
1. Verfasser: Sharoff, Serge
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Some languages have very few NLP resources, while many of them are closely related to better-resourced languages. This paper explores how the similarity between the languages can be utilised by porting resources from better- to lesser-resourced languages. The paper introduces a way of building a representation shared across related languages by combining cross-lingual embedding methods with a lexical similarity measure which is based on the weighted Levenshtein distance. One of the outcomes of the experiments is a Panslavonic embedding space for nine Balto-Slavonic languages. The paper demonstrates that the resulting embedding space helps in such applications as morphological prediction, named-entity recognition and genre classification.
ISSN:1351-3249
1469-8110
DOI:10.1017/S1351324919000354