Using efficient group pseudo-3D network to learn spatio-temporal features

Action classification is a challenging problem in computer vision in recent years; the three-dimensional convolutional neural network plays an important role in spatio-temporal feature extraction. However, the 3D convolution approach needs expensive computation and memory resources. This paper propo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Signal, image and video processing image and video processing, 2021-03, Vol.15 (2), p.361-369
Hauptverfasser:	Chen, Yaosen, Guo, Bing, Shen, Yan, Wang, Wei, Suo, Xinhua, Zhang, Zhen
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Computer Imaging Computer Science Computer vision Convolution Feature extraction Image Processing and Computer Vision Mathematical models Model accuracy Multimedia Information Systems Original Paper Parameters Pattern Recognition and Graphics Signal,Image and Speech Processing Three dimensional models Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Action classification is a challenging problem in computer vision in recent years; the three-dimensional convolutional neural network plays an important role in spatio-temporal feature extraction. However, the 3D convolution approach needs expensive computation and memory resources. This paper proposes an efficient group pseudo-3D (GP3D) convolution to reduce the model size and need less computational power. We built the GP3D with MobileNetV3 to extend the 2D pre-training parameters directly to the 3D convolutional network. We also used GP3D to replace the original inflated 3D convolutional network to efficiently reduce the model size. Compared with other state-of-the-art 3D convolutional networks, GP3D with the efficient network of MobileNetV3 can save about 3 to 22 times of parameters but maintain the same accuracy on the dataset of UCF-101. GP3D with an inflated 3D convolutional network can achieve about 90% top1 accuracy, while the model size is only about half of the original inflated 3D convolutional network.
ISSN:	1863-1703 1863-1711
DOI:	10.1007/s11760-020-01758-5