SMTDKD: A Semantic-Aware Multimodal Transformer Fusion Decoupled Knowledge Distillation Method for Action Recognition

Multimodal sensors, including vision sensors and wearable sensors, offer valuable complementary information for accurate recognition tasks. Nonetheless, the heterogeneity among sensor data from different modalities presents a formidable challenge in extracting robust multimodal information amidst no...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE sensors journal 2024-01, Vol.24 (2), p.1-1
Hauptverfasser:	Quan, Zhenzhen, Chen, Qingshan, Wang, Wei, Zhang, Moyan, Li, Xiang, Li, Yujun, Liu, Zhi
Format:	Artikel
Sprache:	eng
Schlagworte:	decoupled knowledge distillation Distillation Feature extraction Heterogeneity human action recognition Knowledge engineering Knowledge management Multimodal Recognition Semantics Sensors transformer Transformers Video data Visualization Wearable sensors Wearable technology wearable-sensor
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Multimodal sensors, including vision sensors and wearable sensors, offer valuable complementary information for accurate recognition tasks. Nonetheless, the heterogeneity among sensor data from different modalities presents a formidable challenge in extracting robust multimodal information amidst noise. In this paper, we propose an innovative approach, named semantic-aware multimodal transformer fusion decoupled knowledge distillation method (SMTDKD), which not only guides video data recognition through the information interaction between different wearable-sensor data, but also through the information interaction between visual sensor data and wearable-sensor data, improving the robustness of the model. To preserve the temporal relationship within wearable-sensor data, the SMTDKD method converts them into 2D image data. Furthermore, a transformer-based multimodal fusion module is designed to capture diverse feature information from distinct wearable-sensor modalities. To mitigate modality discrepancies and encourage similar semantic features, graph cross-view attention maps are constructed across various convolutional layers to facilitate feature alignment. Additionally, semantic information is exchanged among the teacher-student network, the student network, and BERT-encoded labels. To obtain more comprehensive knowledge transfer, the decoupled knowledge distillation loss is utilized, thereby enhancing the generalization of the network. Experimental evaluations conducted on three multimodal datasets, namely UTD-MHAD, Berkeley-MHAD, and MMAct, demonstrate the superior performance of the proposed SMTDKD method over the state-of-the-art action human recognition methods.
ISSN:	1530-437X 1558-1748
DOI:	10.1109/JSEN.2023.3337367