Efficient voice activity detection algorithms using long-term speech information

Currently, there are technology barriers inhibiting speech processing systems working under extreme noisy conditions. The emerging applications of speech technology, especially in the fields of wireless communications, digital hearing aids or speech recognition, are examples of such systems and ofte...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Speech communication 2004-04, Vol.42 (3), p.271-287
Hauptverfasser: Ramı́rez, Javier, Segura, José C, Benı́tez, Carmen, de la Torre, Ángel, Rubio, Antonio
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Currently, there are technology barriers inhibiting speech processing systems working under extreme noisy conditions. The emerging applications of speech technology, especially in the fields of wireless communications, digital hearing aids or speech recognition, are examples of such systems and often require a noise reduction technique operating in combination with a precise voice activity detector (VAD). This paper presents a new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems. The algorithm measures the long-term spectral divergence (LTSD) between speech and noise and formulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors. The decision threshold is adapted to the measured noise energy while a controlled hang-over is activated only when the observed signal-to-noise ratio is low. It is shown by conducting an analysis of the speech/non-speech LTSD distributions that using long-term information about speech signals is beneficial for VAD. The proposed algorithm is compared to the most commonly used VADs in the field, in terms of speech/non-speech discrimination and in terms of recognition performance when the VAD is used for an automatic speech recognition system. Experimental results demonstrate a sustained advantage over standard VADs such as G.729 and adaptive multi-rate (AMR) which were used as a reference, and over the VADs of the advanced front-end for distributed speech recognition.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2003.10.002