Improved Sentiment Detection via Label Transfer from Monolingual to Synthetic Code-Switched Text
Multilingual writers and speakers often alternate between two languages in a single discourse, a practice called "code-switching". Existing sentiment detection methods are usually trained on sentiment-labeled monolingual text. Manually labeled code-switched text, especially involving minor...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multilingual writers and speakers often alternate between two languages in a
single discourse, a practice called "code-switching". Existing sentiment
detection methods are usually trained on sentiment-labeled monolingual text.
Manually labeled code-switched text, especially involving minority languages,
is extremely rare. Consequently, the best monolingual methods perform
relatively poorly on code-switched text. We present an effective technique for
synthesizing labeled code-switched text from labeled monolingual text, which is
more readily available. The idea is to replace carefully selected subtrees of
constituency parses of sentences in the resource-rich language with suitable
token spans selected from automatic translations to the resource-poor language.
By augmenting scarce human-labeled code-switched text with plentiful synthetic
code-switched text, we achieve significant improvements in sentiment labeling
accuracy (1.5%, 5.11%, 7.20%) for three different language pairs
(English-Hindi, English-Spanish and English-Bengali). We also get significant
gains for hate speech detection: 4% improvement using only synthetic text and
6% if augmented with real text. |
---|---|
DOI: | 10.48550/arxiv.1906.05725 |