Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution
In this work, we investigate various state-of-the-art (SOTA) speech pre-trained models (PTMs) for their capability to capture prosodic signatures of the generative sources for audio deepfake source attribution (ADSD). These prosodic characteristics can be considered one of major signatures for ADSD,...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this work, we investigate various state-of-the-art (SOTA) speech
pre-trained models (PTMs) for their capability to capture prosodic signatures
of the generative sources for audio deepfake source attribution (ADSD). These
prosodic characteristics can be considered one of major signatures for ADSD,
which is unique to each source. So better is the PTM at capturing prosodic
signs better the ADSD performance. We consider various SOTA PTMs that have
shown top performance in different prosodic tasks for our experiments on
benchmark datasets, ASVSpoof 2019 and CFAD. x-vector (speaker recognition PTM)
attains the highest performance in comparison to all the PTMs considered
despite consisting lowest model parameters. This higher performance can be due
to its speaker recognition pre-training that enables it for capturing unique
prosodic characteristics of the sources in a better way. Further, motivated
from tasks such as audio deepfake detection and speech recognition, where
fusion of PTMs representations lead to improved performance, we explore the
same and propose FINDER for effective fusion of such representations. With
fusion of Whisper and x-vector representations through FINDER, we achieved the
topmost performance in comparison to all the individual PTMs as well as
baseline fusion techniques and attaining SOTA performance. |
---|---|
DOI: | 10.48550/arxiv.2412.17796 |