Multi-View Multi-Task Modeling with Speech Foundation Models for Speech Forensic Tasks
Speech forensic tasks (SFTs), such as automatic speaker recognition (ASR), speech emotion recognition (SER), gender recognition (GR), and age estimation (AE), find use in different security and biometric applications. Previous works have applied various techniques, with recent studies focusing on ap...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speech forensic tasks (SFTs), such as automatic speaker recognition (ASR),
speech emotion recognition (SER), gender recognition (GR), and age estimation
(AE), find use in different security and biometric applications. Previous works
have applied various techniques, with recent studies focusing on applying
speech foundation models (SFMs) for improved performance. However, most prior
efforts have centered on building individual models for each task separately,
despite the inherent similarities among these tasks. This isolated approach
results in higher computational resource requirements, increased costs, time
consumption, and maintenance challenges. In this study, we address these
challenges by employing a multi-task learning strategy. Firstly, we explore the
various state-of-the-art (SOTA) SFMs by extracting their representations for
learning these SFTs and investigating their effectiveness at each task
specifically. Secondly, we analyze the performance of the extracted
representations on the SFTs in a multi-task learning framework. We observe a
decline in performance when SFTs are modeled together compared to individual
task-specific models, and as a remedy, we propose multi-view learning (MVL).
Views are representations from different SFMs transformed into distinct
abstract spaces by characteristics unique to each SFM. By leveraging MVL, we
integrate these diverse representations to capture complementary information
across tasks, enhancing the shared learning process. We introduce a new
framework called TANGO (Task Alignment with iNter-view Gated Optimal transport)
to implement this approach. With TANGO, we achieve the topmost performance in
comparison to individual SFM representations as well as baseline fusion
techniques across benchmark datasets such as CREMA-D, emo-DB, and BAVED. |
---|---|
DOI: | 10.48550/arxiv.2410.12947 |