Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accur...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Sign language translation (SLT) is a challenging task that involves
translating sign language images into spoken language. For SLT models to
perform this task successfully, they must bridge the modality gap and identify
subtle variations in sign language components to understand their meanings
accurately. To address these challenges, we propose a novel gloss-free SLT
framework called Multimodal Sign Language Translation (MMSLT), which leverages
the representational capabilities of off-the-shelf multimodal large language
models (MLLMs). Specifically, we generate detailed textual descriptions of sign
language components using MLLMs. Then, through our proposed multimodal-language
pre-training module, we integrate these description features with sign video
features to align them within the spoken sentence space. Our approach achieves
state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily,
highlighting the potential of MLLMs to be effectively utilized in SLT. |
---|---|
DOI: | 10.48550/arxiv.2411.16789 |