Bridging the Granularity Gap for Acoustic Modeling

While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose \textit{Progressive Down-Sampling} (PDS) which gradually compresses...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-05
Hauptverfasser:	Chen, Xu, Zhang, Yuhao, Jiao, Chengbo, Liu, Xiaoqian, Hu, Chi, Zeng, Xin, Tong, Xiao, Ma, Anxiang, Wang, Huizhen, Zhu, JingBo
Format:	Artikel
Sprache:	eng
Schlagworte:	Modelling Representations Speech recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose \textit{Progressive Down-Sampling} (PDS) which gradually compresses the acoustic features into coarser-grained units containing more complete semantic information, like text-level representation. In addition, we develop a representation fusion method to alleviate information loss that occurs inevitably during high compression. In this way, we compress the acoustic features into 1/32 of the initial length while achieving better or comparable performances on the speech recognition task. And as a bonus, it yields inference speedups ranging from 1.20\(\times\) to 1.47\(\times\). By reducing the modeling burden, we also achieve competitive results when training on the more challenging speech translation task.
ISSN:	2331-8422