TOGGL: Transcribing Overlapping Speech with Staggered Labeling
Transcribing the speech of multiple overlapping speakers typically requires separating the audio into multiple streams and recognizing each one independently. More recent work jointly separates and transcribes, but requires a separate decoding component for each speaker. We propose the TOGGL model t...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Transcribing the speech of multiple overlapping speakers typically requires
separating the audio into multiple streams and recognizing each one
independently. More recent work jointly separates and transcribes, but requires
a separate decoding component for each speaker. We propose the TOGGL model to
simultaneously transcribe the speech of multiple speakers. The TOGGL model uses
special output tokens to attribute the speech to each speaker with only a
single decoder. Our approach generalizes beyond two speakers, even when trained
only on two-speaker data. We demonstrate superior performance compared to
competing approaches on a conversational speech dataset. Our approach also
improves performance on single-speaker audio. |
---|---|
DOI: | 10.48550/arxiv.2408.06474 |