Sparse Deep LSTMs with Convolutional Attention for Human Action Recognition

Deep learning has recently gained remarkable results in action recognition. In this paper, an architecture is proposed for action recognition, including ResNet feature extractor, Conv-Attention-LSTM, BiLSTM, and fully connected layers. Furthermore, a sparse layer after each LSTM layer is added to ov...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	SN computer science 2021-05, Vol.2 (3), p.151, Article 151
Hauptverfasser:	Aghaei, Atefe, Nazari, Ali, Moghaddam, Mohsen Ebrahimi
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Classification Color imagery Computer Imaging Computer Science Computer Systems Organization and Communication Networks Data Structures and Information Theory Datasets Deep learning Feature extraction Frames (data processing) Human activity recognition Information Systems and Communication Service Methods Neural networks Optical flow (image analysis) Original Research Pattern Recognition and Graphics Sensors Software Engineering/Programming and Operating Systems Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Deep learning has recently gained remarkable results in action recognition. In this paper, an architecture is proposed for action recognition, including ResNet feature extractor, Conv-Attention-LSTM, BiLSTM, and fully connected layers. Furthermore, a sparse layer after each LSTM layer is added to overcome overfitting. In addition to RGB images, optical flow is also used to incorporate motion information into our architecture . Due to similarities of consecutive frames, video sequences are divided into equal parts. Frames of successive parts are used as consecutive frames to obtain the flow. Furthermore, to find the significant regions, the convolutional attention network is applied. The proposed method is evaluated using two popular datasets, UCF-101, and HMDB-51 and its accuracy for these datasets is 95.24 and 71.62, respectively. Overfitting is reduced due to using a sparse layer instead of a dropout based on the results achieved. Moreover, a deep LSTM network leads to a higher recognition rate than one-layer LSTM.
ISSN:	2662-995X 2661-8907
DOI:	10.1007/s42979-021-00576-x