CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations
Deriving multimodal representations of audio and lexical inputs is a central problem in Natural Language Understanding (NLU). In this paper, we present Contrastive Aligned Audio-Language Multirate and Multimodal Representations (CALM), an approach for learning multimodal representations using contra...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Deriving multimodal representations of audio and lexical inputs is a central
problem in Natural Language Understanding (NLU). In this paper, we present
Contrastive Aligned Audio-Language Multirate and Multimodal Representations
(CALM), an approach for learning multimodal representations using contrastive
and multirate information inherent in audio and lexical inputs. The proposed
model aligns acoustic and lexical information in the input embedding space of a
pretrained language-only contextual embedding model. By aligning audio
representations to pretrained language representations and utilizing
contrastive information between acoustic inputs, CALM is able to bootstrap
audio embedding competitive with existing audio representation models in only a
few hours of training time. Operationally, audio spectrograms are processed
using linearized patches through a Spectral Transformer (SpecTran) which is
trained using a Contrastive Audio-Language Pretraining objective to align audio
and language from similar queries. Subsequently, the derived acoustic and
lexical tokens representations are input into a multimodal transformer to
incorporate utterance level context and derive the proposed CALM
representations. We show that these pretrained embeddings can subsequently be
used in multimodal supervised tasks and demonstrate the benefits of the
proposed pretraining steps in terms of the alignment of the two embedding
spaces and the multirate nature of the pretraining. Our system shows 10-25\%
improvement over existing emotion recognition systems including
state-of-the-art three-modality systems under various evaluation objectives. |
---|---|
DOI: | 10.48550/arxiv.2202.03587 |