Voicing-, voiceless-, and non-glimpses in speech intelligibility prediction

The number of speech spectro-temporal (S-T) regions escaping from noise masking, known as “glimpses,” is proportional to speech intelligibility in noise. Previous studies have demonstrated that intelligibility can be estimated by calculating the glimpse proportion (GP). More recent evidence revealed...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The Journal of the Acoustical Society of America 2023-03, Vol.153 (3_supplement), p.A172-A172
Hauptverfasser: Sun, Yinglun, Tang, Yan
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The number of speech spectro-temporal (S-T) regions escaping from noise masking, known as “glimpses,” is proportional to speech intelligibility in noise. Previous studies have demonstrated that intelligibility can be estimated by calculating the glimpse proportion (GP). More recent evidence revealed that the contribution of glimpses to intelligibility differs in the energy level of the glimpsed regions, and that even non-glimpsed regions play a non-negligible role in speech perception in noise. This study incorporated the voicing-viceless information in estimating intelligibility using glimpses. Before computing the GP, the counts of raw glimpsed regions or those with energy above the mean noise level were weighted according to the voicing-voiceless status of a frame where the glimpses were detected. Evaluated using speech signals processed to have thirteen glimpse compositions in both temporally stationary and fluctuating noise maskers, the linear correlation between model predictions and listeners' word recognition rates increased from 0.76 to 0.80 for weighted GP, and from 0.89 to 0.92 for weighted high-energy GP. Further taking the contribution from non-glimpsed regions into account in the model improved the correlation to 0.95, suggesting that intelligibility in noise can be better predicted when the contributions of different speech regions are finely modelled.
ISSN:0001-4966
1520-8524
DOI:10.1121/10.0018560