SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models
Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models incl...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speaker Diarization (SD) is a crucial component of modern end-to-end ASR
pipelines. Traditional SD systems, which are typically audio-based and operate
independently of ASR, often introduce speaker errors, particularly during
speaker transitions and overlapping speech. Recently, language models including
fine-tuned large language models (LLMs) have shown to be effective as a
second-pass speaker error corrector by leveraging lexical context in the
transcribed output. In this work, we introduce a novel acoustic conditioning
approach to provide more fine-grained information from the acoustic diarizer to
the LLM. We also show that a simpler constrained decoding strategy reduces LLM
hallucinations, while avoiding complicated post-processing. Our approach
significantly reduces the speaker error rates by 24-43% across Fisher,
Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD. |
---|---|
DOI: | 10.48550/arxiv.2501.08421 |