Multimodal fusion for audio-image and video action recognition

Multimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Actio...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural computing & applications 2024-04, Vol.36 (10), p.5499-5513
Hauptverfasser:	Shaikh, Muhammad Bilal, Chai, Douglas, Islam, Syed Mohammed Shamsul, Akhtar, Naveed
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Algorithms Artificial Intelligence Audio data Audio signals Classification Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Computer vision Data Mining and Knowledge Discovery Deep learning Feature extraction Feature maps Feature recognition Human activity recognition Image Processing and Computer Vision Machine learning Original Article Probability and Statistics in Computer Science Representations Sound Spatial data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Multimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes the influence of audio and video feature maps and uses them to classify an action. Compared with state-of-the-art audio-visual MHAR techniques, the proposed approach features a simpler yet more accurate and more generalizable architecture, one that performs better with different audio-image representations. The system achieves an accuracy 87.9% and 79.0% on UCF51 and Kinetics Sounds datasets, respectively. All code and models for this paper will be available at https://tinyurl.com/4ps2ux6n .
ISSN:	0941-0643 1433-3058
DOI:	10.1007/s00521-023-09186-5