Like a Baby: Visually Situated Neural Language Acquisition
We examine the benefits of visual context in training neural language models to perform next-word prediction. A multi-modal neural architecture is introduced that outperform its equivalent trained on language alone with a 2\% decrease in perplexity, even when no visual context is available at test....
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We examine the benefits of visual context in training neural language models
to perform next-word prediction. A multi-modal neural architecture is
introduced that outperform its equivalent trained on language alone with a 2\%
decrease in perplexity, even when no visual context is available at test.
Fine-tuning the embeddings of a pre-trained state-of-the-art bidirectional
language model (BERT) in the language modeling framework yields a 3.5\%
improvement. The advantage for training with visual context when testing
without is robust across different languages (English, German and Spanish) and
different models (GRU, LSTM, $\Delta$-RNN, as well as those that use BERT
embeddings). Thus, language models perform better when they learn like a baby,
i.e, in a multi-modal environment. This finding is compatible with the theory
of situated cognition: language is inseparable from its physical context. |
---|---|
DOI: | 10.48550/arxiv.1805.11546 |