Using efficient group pseudo-3D network to learn spatio-temporal features

Action classification is a challenging problem in computer vision in recent years; the three-dimensional convolutional neural network plays an important role in spatio-temporal feature extraction. However, the 3D convolution approach needs expensive computation and memory resources. This paper propo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Signal, image and video processing image and video processing, 2021-03, Vol.15 (2), p.361-369
Hauptverfasser: Chen, Yaosen, Guo, Bing, Shen, Yan, Wang, Wei, Suo, Xinhua, Zhang, Zhen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Action classification is a challenging problem in computer vision in recent years; the three-dimensional convolutional neural network plays an important role in spatio-temporal feature extraction. However, the 3D convolution approach needs expensive computation and memory resources. This paper proposes an efficient group pseudo-3D (GP3D) convolution to reduce the model size and need less computational power. We built the GP3D with MobileNetV3 to extend the 2D pre-training parameters directly to the 3D convolutional network. We also used GP3D to replace the original inflated 3D convolutional network to efficiently reduce the model size. Compared with other state-of-the-art 3D convolutional networks, GP3D with the efficient network of MobileNetV3 can save about 3 to 22 times of parameters but maintain the same accuracy on the dataset of UCF-101. GP3D with an inflated 3D convolutional network can achieve about 90% top1 accuracy, while the model size is only about half of the original inflated 3D convolutional network.
ISSN:1863-1703
1863-1711
DOI:10.1007/s11760-020-01758-5