LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation
Recent advancements in integrating speech information into large language models (LLMs) have significantly improved automatic speech recognition (ASR) accuracy. However, existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. T...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advancements in integrating speech information into large language
models (LLMs) have significantly improved automatic speech recognition (ASR)
accuracy. However, existing methods often constrained by the capabilities of
the speech encoders under varied acoustic conditions, such as accents. To
address this, we propose LA-RAG, a novel Retrieval-Augmented Generation (RAG)
paradigm for LLM-based ASR. LA-RAG leverages fine-grained token-level speech
datastores and a speech-to-speech retrieval mechanism to enhance ASR accuracy
via LLM in-context learning (ICL) capabilities. Experiments on Mandarin and
various Chinese dialect datasets demonstrate significant improvements in ASR
accuracy compared to existing methods, validating the effectiveness of our
approach, especially in handling accent variations. |
---|---|
DOI: | 10.48550/arxiv.2409.08597 |