Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases
Speech-to-Text Translation (S2TT) has typically been addressed with cascade systems, where speech recognition systems generate a transcription that is subsequently passed to a translation model. While there has been a growing interest in developing direct speech translation systems to avoid propagat...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speech-to-Text Translation (S2TT) has typically been addressed with cascade
systems, where speech recognition systems generate a transcription that is
subsequently passed to a translation model. While there has been a growing
interest in developing direct speech translation systems to avoid propagating
errors and losing non-verbal content, prior work in direct S2TT has struggled
to conclusively establish the advantages of integrating the acoustic signal
directly into the translation process. This work proposes using contrastive
evaluation to quantitatively measure the ability of direct S2TT systems to
disambiguate utterances where prosody plays a crucial role. Specifically, we
evaluated Korean-English translation systems on a test set containing
wh-phrases, for which prosodic features are necessary to produce translations
with the correct intent, whether it's a statement, a yes/no question, a
wh-question, and more. Our results clearly demonstrate the value of direct
translation systems over cascade translation models, with a notable 12.9%
improvement in overall accuracy in ambiguous cases, along with up to a 15.6%
increase in F1 scores for one of the major intent categories. To the best of
our knowledge, this work stands as the first to provide quantitative evidence
that direct S2TT models can effectively leverage prosody. The code for our
evaluation is openly accessible and freely available for review and
utilisation. |
---|---|
DOI: | 10.48550/arxiv.2402.00632 |