M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper
State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augment...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | State-of-the-art models like OpenAI's Whisper exhibit strong performance in
multilingual automatic speech recognition (ASR), but they still face challenges
in accurately recognizing diverse subdialects. In this paper, we propose
M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation
approach designed to enhance ASR performance in low-resource settings. Building
on the principles of in-context learning (ICL) and retrieval-augmented
techniques, our method employs sentence-level ICL in the pre-processing stage
to harness contextual information, while integrating token-level k-Nearest
Neighbors (kNN) retrieval as a post-processing step to further refine the final
output distribution. By synergistically combining sentence-level and
token-level retrieval strategies, M2R-whisper effectively mitigates various
types of recognition errors. Experiments conducted on Mandarin and subdialect
datasets, including AISHELL-1 and KeSpeech, demonstrate substantial
improvements in ASR accuracy, all achieved without any parameter updates. |
---|---|
DOI: | 10.48550/arxiv.2409.11889 |