A Deep Generative Model for Code-Switched Text
Code-switching, the interleaving of two or more languages within a sentence or discourse is pervasive in multilingual societies. Accurate language models for code-switched text are critical for NLP tasks. State-of-the-art data-intensive neural language models are difficult to train well from scarce...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Code-switching, the interleaving of two or more languages within a sentence
or discourse is pervasive in multilingual societies. Accurate language models
for code-switched text are critical for NLP tasks. State-of-the-art
data-intensive neural language models are difficult to train well from scarce
language-labeled code-switched text. A potential solution is to use deep
generative models to synthesize large volumes of realistic code-switched text.
Although generative adversarial networks and variational autoencoders can
synthesize plausible monolingual text from continuous latent space, they cannot
adequately address code-switched text, owing to their informal style and
complex interplay between the constituent languages. We introduce VACS, a novel
variational autoencoder architecture specifically tailored to code-switching
phenomena. VACS encodes to and decodes from a two-level hierarchical
representation, which models syntactic contextual signals in the lower level,
and language switching signals in the upper layer. Sampling representations
from the prior and decoding them produced well-formed, diverse code-switched
sentences. Extensive experiments show that using synthetic code-switched text
with natural monolingual data results in significant (33.06%) drop in
perplexity. |
---|---|
DOI: | 10.48550/arxiv.1906.08972 |