Script-Agnostic Language Identification
Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, la...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Language identification is used as the first step in many data collection and
crawling efforts because it allows us to sort online text into
language-specific buckets. However, many modern languages, such as Konkani,
Kashmiri, Punjabi etc., are synchronically written in several scripts.
Moreover, languages with different writing systems do not share significant
lexical, semantic, and syntactic properties in neural representation spaces,
which is a disadvantage for closely related languages and low-resource
languages, especially those from the Indian Subcontinent. To counter this, we
propose learning script-agnostic representations using several different
experimental strategies (upscaling, flattening, and script mixing) focusing on
four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find
that word-level script randomization and exposure to a language written in
multiple scripts is extremely valuable for downstream script-agnostic language
identification, while also maintaining competitive performance on naturally
occurring text. |
---|---|
DOI: | 10.48550/arxiv.2406.17901 |