Still image action recognition based on interactions between joints and objects
Still image-based action recognition is a challenging area in which recognition is performed based on only a single input image. Utilizing auxiliary information such as pose, object, or background is one of the common techniques in this field. However, the simultaneous use of several auxiliary compo...
Gespeichert in:
Veröffentlicht in: | Multimedia tools and applications 2023-07, Vol.82 (17), p.25945-25971 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Still image-based action recognition is a challenging area in which recognition is performed based on only a single input image. Utilizing auxiliary information such as pose, object, or background is one of the common techniques in this field. However, the simultaneous use of several auxiliary components and their optimal combinations is less studied. In this study, two cues of body joints and objects have been employed simultaneously, and an attention module is proposed to combine the features of these two components. The attention module consists of two self-attentions and a cross-attention, which are designed to account for the interaction between the objects, between the joints, and between the joints and objects, respectively. In addition, the Multi-scale Atrous Spatial Pyramid Pooling (MASPP) module is proposed to reduce the number of parameters of the proposed method and at the same time, combine the features obtained from different levels of the backbone. The Joint Object Pooling (JOPool) module is proposed to extract local features from joints and objects regions. ResNets are used as the backbone, and the stride of the last two layers is changed. Experimental results on different datasets show that the combination of several auxiliary components can be effective in increasing the mean Average Precision (mAP) of recognition. The proposed method is evaluated on three important datasets: Stanford-40, PASCAL VOC 2012, and BU101PLUS resulting in 94.84%, 93.20%, and 91.25% mAPs, respectively. The obtained mAPs are higher than the best preceding proposed methods. |
---|---|
ISSN: | 1380-7501 1573-7721 |
DOI: | 10.1007/s11042-023-14350-z |