data2lang2vec: Data Driven Typological Features Completion
Language typology databases enhance multi-lingual Natural Language Processing (NLP) by improving model adaptability to diverse linguistic structures. The widely-used lang2vec toolkit integrates several such databases, but its coverage remains limited at 28.9\%. Previous work on automatically increas...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Language typology databases enhance multi-lingual Natural Language Processing
(NLP) by improving model adaptability to diverse linguistic structures. The
widely-used lang2vec toolkit integrates several such databases, but its
coverage remains limited at 28.9\%. Previous work on automatically increasing
coverage predicts missing values based on features from other languages or
focuses on single features, we propose to use textual data for better-informed
feature prediction. To this end, we introduce a multi-lingual Part-of-Speech
(POS) tagger, achieving over 70\% accuracy across 1,749 languages, and
experiment with external statistical features and a variety of machine learning
algorithms. We also introduce a more realistic evaluation setup, focusing on
likely to be missing typology features, and show that our approach outperforms
previous work in both setups. |
---|---|
DOI: | 10.48550/arxiv.2409.17373 |