One Model to Rule Them all: A Universal Transformer for Biometric Matching

This study introduces the first single branch network designed to tackle a spectrum of biometric matching scenarios, including unimodal, multimodal, cross-modal, and missing modality situations. Our method adapts the prototypical network loss to concurrently train on audio, visual, and thermal data...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2024, Vol.12, p.96729-96739
Hauptverfasser:	Abdrakhmanova, Madina, Yermekova, Assel, Barko, Yuliya, Ryspayev, Vladislav, Jumadildayev, Medet, Varol, Huseyin Atakan
Format:	Artikel
Sprache:	eng
Schlagworte:	Audio data Biological system modeling Biometric matching Biometrics Biometrics (access control) cross-modal matching Effectiveness face verification face-audio association Feature extraction Matching metric learning multimodal verification speaker verification Speech recognition Task analysis transformer Transformers Vector spaces Vectors Visualization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This study introduces the first single branch network designed to tackle a spectrum of biometric matching scenarios, including unimodal, multimodal, cross-modal, and missing modality situations. Our method adapts the prototypical network loss to concurrently train on audio, visual, and thermal data within a unified multimodal framework. By converting all three data types into image format, we employ the Vision Transformer (ViT) architecture with shared model parameters, enabling the encoder to transform input modalities into a unified vector space. The multimodal prototypical network loss function ensures that vector representations of the same speaker are proximate regardless of their original modalities. Evaluation on SpeakingFaces and VoxCeleb datasets encompasses a wide range of scenarios, demonstrating the effectiveness of our approach. The trimodal model achieves an Equal Error Rate (EER) of 0.27% on the SpeakingFaces test split, surpassing all previously reported results. Moreover, with a single training, it exhibits comparable performance with unimodal and bimodal counterparts, including unimodal audio, visual, and thermal, as well as audio-visual, audio-thermal, and visual-thermal configurations. In cross-modal evaluation on the VoxCeleb1 test set (audio versus visual), our approach yields an EER of 24.1%, again outperforming state-of-the-art models. This underscores the effectiveness of our unified model in addressing diverse scenarios for biometric verification.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3426602