Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection

Action recognition in still images is an interesting subject in computer vision. One of the most important problems in still image-based action recognition is the lack of temporal information; At the same time, other existing problems such as cluttered backgrounds and diverse objects make the recogn...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia tools and applications 2021-09, Vol.80 (21-23), p.32567-32593
Hauptverfasser: Ashrafi, Seyed Sajad, Shokouhi, Shahriar B., Ayatollahi, Ahmad
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Action recognition in still images is an interesting subject in computer vision. One of the most important problems in still image-based action recognition is the lack of temporal information; At the same time, other existing problems such as cluttered backgrounds and diverse objects make the recognition task more challenging. However, there may be several salient regions in each action image, employing of which could lead to an improvement in the recognition performance. Moreover, since no unique and clear definition exists for detecting these salient regions in action recognition images, therefore, obtaining reliable ground truth salient regions is a highly challenging task. This paper presents a multi-attention guided network with weakly-supervised multiple salient regions detection for action recognition. A teacher-student structure is used to guide the attention of the student model into the salient regions. The teacher network with Salient Region Proposal (SRP) module generates weakly-supervised data for the student network in the training phase. The student network, with Multi-ATtention (MAT) module, proposes multiple salient regions and predicts the actions based on the found information in the evaluation phase. The proposed method obtains mean Average Precision (mAP) value of 94.2% and 93.80% on Stanford-40 Actions and PASCAL VOC2012 datasets, respectively. The experimental results, based on the ResNet-50 architecture, show the superiority of the proposed method compared to the existing ones on Stanford-40 and VOC2012 datasets. Also, we have made a major modification to the BU101 dataset which is now publicly available. The proposed method achieves mAP value of 90.16% on the new BU101 dataset.
ISSN:1380-7501
1573-7721
DOI:10.1007/s11042-021-11215-1