Response type selection for chat-like spoken dialog systems based on LSTM and multi-task learning

We propose a method of automatically selecting appropriate responses in conversational spoken dialog systems by explicitly determining the correct response type that is needed first, based on a comparison of the user’s input utterance with many other utterances. Response utterances are then generate...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Speech communication 2021-10, Vol.133, p.23-30
Hauptverfasser: Ohta, Kengo, Nishimura, Ryota, Kitaoka, Norihide
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We propose a method of automatically selecting appropriate responses in conversational spoken dialog systems by explicitly determining the correct response type that is needed first, based on a comparison of the user’s input utterance with many other utterances. Response utterances are then generated based on this response type designation (back channel, changing the topic, expanding the topic, etc.). This allows the generation of more appropriate responses than conventional end-to-end approaches, which only use the user’s input to directly generate response utterances. As a response type selector, we propose an LSTM-based encoder–decoder framework utilizing acoustic and linguistic features extracted from input utterances. In order to extract these features more accurately, we utilize not only input utterances but also response utterances in the training corpus. To do so, multi-task learning using multiple decoders is also investigated. To evaluate our proposed method, we conducted experiments using a corpus of dialogs between elderly people and an interviewer. Our proposed method outperformed conventional methods using either a point-wise classifier based on Support Vector Machines, or a single-task learning LSTM. The best performance was achieved when our two response type selectors (one trained using acoustic features, and the other trained using linguistic features) were combined, and multi-task learning was also performed. •We propose a response type selector for conversational spoken dialog systems.•Multi-task learning with multiple decoders is applied.•Our proposed method achieved better performance than conventional methods.•Effects of acoustic features and linguistic features are compared.•The best accuracy was achieved when acoustic and linguistic features were combined.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2021.07.003