Multimodal deep representation learning for video classification

Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	World wide web (Bussum) 2019-05, Vol.22 (3), p.1325-1341
Hauptverfasser:	Tian, Haiman, Tao, Yudong, Pouyanfar, Samira, Chen, Shu-Ching, Shyu, Mei-Ling
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Computer Science Database Management Feature extraction Information Systems Applications (incl.Internet) Machine learning Missing data Neural networks Operating Systems Representations Special Issue on Social Media and Interactive Technologies Video data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and only focus on a single modality. This paper presents a new multimodal deep learning framework for event detection from videos by leveraging recent advances in deep neural networks. First, several deep learning models are utilized to extract useful information from multiple modalities. Among these are pre-trained Convolutional Neural Networks (CNNs) for visual and audio feature extraction and a word embedding model for textual analysis. Then, a novel fusion technique is proposed that integrates different data representations in two levels, namely frame-level and video-level. Different from the existing multimodal learning algorithms, the proposed framework can reason about a missing data type using other available data modalities. The proposed framework is applied to a new video dataset containing natural disaster classes. The experimental results illustrate the effectiveness of the proposed framework compared to some single modal deep learning models as well as conventional fusion techniques. Specifically, the final accuracy is improved more than 16% and 7% compared to the best results from single modality and fusion models, respectively.
ISSN:	1386-145X 1573-1413
DOI:	10.1007/s11280-018-0548-3