Compositional action recognition with multi-view feature fusion

Most action recognition tasks now treat the activity as a single event in a video clip. Recently, the benefits of representing activities as a combination of verbs and nouns for action recognition have shown to be effective in improving action understanding, allowing us to capture such representatio...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PloS one 2022-04, Vol.17 (4), p.e0266259-e0266259
Hauptverfasser: Zhao, Zhicheng, Liu, Yingan, Ma, Lei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Most action recognition tasks now treat the activity as a single event in a video clip. Recently, the benefits of representing activities as a combination of verbs and nouns for action recognition have shown to be effective in improving action understanding, allowing us to capture such representations. However, there is still a lack of research on representational learning using cross-view or cross-modality information. To exploit the complementary information between multiple views, we propose a feature fusion framework, and our framework is divided into two steps: extraction of appearance features and fusion of multi-view features. We validate our approach on two action recognition datasets, IKEA ASM and LEMMA. We demonstrate that multi-view fusion can effectively generalize across appearances and identify previously unseen actions of interacting objects, surpassing current state-of-the-art methods. In particular, on the IKEA ASM dataset, the performance of the multi-view fusion approach improves 18.1% over the performance of the single-view approach on top-1.
ISSN:1932-6203
1932-6203
DOI:10.1371/journal.pone.0266259