End-to-end speech-to-dialog-act recognition
Spoken language understanding, which extracts intents and/or semantic concepts in utterances, is conventionally formulated as a post-processing of automatic speech recognition. It is usually trained with oracle transcripts, but needs to deal with errors by ASR. Moreover, there are acoustic features...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Spoken language understanding, which extracts intents and/or semantic
concepts in utterances, is conventionally formulated as a post-processing of
automatic speech recognition. It is usually trained with oracle transcripts,
but needs to deal with errors by ASR. Moreover, there are acoustic features
which are related with intents but not represented with the transcripts. In
this paper, we present an end-to-end model which directly converts speech into
dialog acts without the deterministic transcription process. In the proposed
model, the dialog act recognition network is conjunct with an acoustic-to-word
ASR model at its latent layer before the softmax layer, which provides a
distributed representation of word-level ASR decoding information. Then, the
entire network is fine-tuned in an end-to-end manner. This allows for stable
training as well as robustness against ASR errors. The model is further
extended to conduct DA segmentation jointly. Evaluations with the Switchboard
corpus demonstrate that the proposed method significantly improves dialog act
recognition accuracy from the conventional pipeline framework. |
---|---|
DOI: | 10.48550/arxiv.2004.11419 |