Modeling speech localization, identification, and word recognition in a multi-talker setting

In many everyday situations, listeners are confronted with complex acoustic scenes. Despite the complexity of these scenes, they are able to follow and understand one particular talker. This contribution presents auditory models that aim to solve speech-related tasks in multi-talker settings. The ma...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	The Journal of the Acoustical Society of America 2017-05, Vol.141 (5), p.3693-3693
Hauptverfasser:	Josupeit, Angela, Luberadzka, Joanna, Hohmann, Volker
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In many everyday situations, listeners are confronted with complex acoustic scenes. Despite the complexity of these scenes, they are able to follow and understand one particular talker. This contribution presents auditory models that aim to solve speech-related tasks in multi-talker settings. The main characteristics of the models are: (1) restriction to salient auditory features (“glimpses”); (2) usage of periodicity, periodic energy, and binaural features; and (3) template-based classification methods using clean speech models. Further classification approaches using state-space models will be discussed. The model performance is evaluated on the basis of human psychoacoustic data [e.g., Brungart and Simpson, Perception & Psychophysics, 2007, 69(1), 79-91; Schoenmaker and van de Par, Physiology, Psychoacoustics and Cognition in Normal and Impaired Hearing, 2016, 73-81]. The model results were mostly found to be similar to the subject results. This suggests that sparse glimpses of periodicity-related monaural and binaural auditory features provide sufficient information about a complex auditory scene involving multiple talkers. Furthermore, it can be concluded that the usage of clean speech models is sufficient to decode speech information from the glimpses derived from a complex scene, i.e., computationally complex models of sound source superposition are not required for decoding a speech stream.
ISSN:	0001-4966 1520-8524
DOI:	10.1121/1.4988045