Enhanced video clustering using multiple riemannian manifold-valued descriptors and audio-visual information

Videos inherently blend multiple modalities in real-world scenarios, primarily visual and auditory cues. When synergized, these cues foster enhanced data representations. Standard clustering techniques, primarily designed for managing vectorial data in Euclidean spaces, struggle to handle multidimen...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications 2024-07, Vol.246, p.123099, Article 123099
Hauptverfasser:	Hu, Wenbo, Zhan, Hongjian, Tian, Yinghong, Xiong, Yujie, Lu, Yue
Format:	Artikel
Sprache:	eng
Schlagworte:	Audio-visual Riemannian manifolds Subspace clustering
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Videos inherently blend multiple modalities in real-world scenarios, primarily visual and auditory cues. When synergized, these cues foster enhanced data representations. Standard clustering techniques, primarily designed for managing vectorial data in Euclidean spaces, struggle to handle multidimensional data with nonlinear manifold structures, such as video or image sets. While recent subspace clustering methods using Riemannian manifold representation tackle this issue, they often sideline auditory information, overlooking the potential harmony between visual and auditory modalities. This paper presents an innovative approach that crafts multiple Riemannian manifold-valued descriptors to bridge this gap, encapsulating multimodal video information in a unified structure. We architect a single-modality Riemannian subspace clustering for individual modal data and extend it to a multi-modality framework, leveraging the interplay of audio-visual data. Detailed optimization and convergence analysis are also provided. The proposed approach significantly outperforms the existing state-of-the-art methods, improving accuracy by 4%, 1%, and 2% on UCF-101, UCF-sport, and AVE datasets, respectively. •Audio-visual integration boosts video clustering.•Riemannian descriptors ensure precise video representation.•Multiple Riemannian descriptors unify video’s multimodal information.•Experiments reveal improved video clustering performance.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2023.123099