Error-Diffusion Based Speech Feature Quantization for Small-Footprint Keyword Spotting

Neural network based keyword spotting (KWS) system is a critical component for user interaction in current smart devices. Although small-footprint networks have been widely explored to reduce deployment overhead, low-precision input feature representation still lacks in-depth research. In this lette...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE signal processing letters 2022, Vol.29, p.1357-1361
Hauptverfasser:	Luo, Mengjie, Wang, Dingyi, Wang, Xiaoqin, Qiao, Shushan, Zhou, Yumei
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms convolutional neural networks Critical components Diffusion Electronic devices Error analysis error diffusion Feature maps Filter banks Image processing Impact analysis Keyword spotting Keywords Measurement Neural networks Quantization (signal) Representations Signal processing algorithms Spectrogram Spectrograms Speech speech feature quantization Speech processing Task analysis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Neural network based keyword spotting (KWS) system is a critical component for user interaction in current smart devices. Although small-footprint networks have been widely explored to reduce deployment overhead, low-precision input feature representation still lacks in-depth research. In this letter, an error-diffusion based speech feature quantization method is proposed. Specifically, our algorithm adapts image processing to quantize the input speech feature maps in arbitrary bits. Experiments show that in the 10-keyword KWS task, our 3-bit representation only brings a 0.45% average accuracy drop compared to the full-precision log-Mel spectrograms while others drop over 3%. In the 2 keywords task, our 3-bit representation produces no significant differences, while 1-bit quantization only leads to an average of 1.7% accuracy drop and is even capable of handling similar keywords and imbalanced data distribution. The result proves our method, to the best of our knowledge, is the first practical method that supports as low as 1-bit quantization for single-channel speech features in small-footprint KWS. In addition, we analyze the impact of error-diffusion directions and conclude that time-direction diffusion is more suitable for temporal convolutional networks.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2022.3179208