Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition
Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require an...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Currently, end-to-end (E2E) speech recognition methods have achieved
promising performance. However, auto speech recognition (ASR) models still face
challenges in recognizing multi-accent speech accurately. We propose a
layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require
any prior knowledge about the target accent. Based on dynamic chunk strategy,
our approach enables streaming decoding and can extract frame-level acoustic
feature, facilitating fine-grained information fusion. Experiment results
demonstrate that our proposed methods outperform the baseline with relative
reductions of 22.1$\%$ and 17.2$\%$ in character error rate (CER) across multi
accent test datasets on KeSpeech and MagicData-RMAC. |
---|---|
DOI: | 10.48550/arxiv.2407.03026 |