Context-aware Transliteration of Romanized South Asian Languages
While most transliteration research is focused on single tokens such as named entities—for example, transliteration of from the Gujarati script to the Latin script “Ahmedabad” footnoteThe most populous city in the Indian state of Gujarat. the informal romanization prevalent in South Asia and elsewhe...
Gespeichert in:
Veröffentlicht in: | Computational linguistics - Association for Computational Linguistics 2024-06, Vol.50 (2), p.475-534 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | While most transliteration research is focused on single tokens such as named
entities—for example, transliteration of
from the Gujarati script to the Latin
script “Ahmedabad” footnoteThe most populous city in the Indian
state of Gujarat. the informal romanization prevalent in South Asia and
elsewhere often requires transliteration of full sentences. The lack of large
parallel text collections of full sentence (as opposed to single word)
transliterations necessitates incorporation of contextual information into
transliteration via non-parallel resources, such as via mono-script text
collections. In this article, we present a number of methods for improving
transliteration in context for such a use scenario. Some of these methods in
fact improve performance without making use of sentential context, allowing for
better quantification of the degree to which contextual information in
particular is responsible for system improvements. Our final systems, which
ultimately rely upon ensembles including large pretrained language models
fine-tuned on simulated parallel data, yield substantial improvements over the
best previously reported results for full sentence transliteration from Latin to
native script on all 12 languages in the Dakshina dataset (Roark et al.
), with an overall 3.3%
absolute (18.6% relative) mean word-error rate reduction. |
---|---|
ISSN: | 0891-2017 1530-9312 |
DOI: | 10.1162/coli_a_00510 |