A part-based spatial and temporal aggregation method for dynamic scene recognition

Existing methods for dynamic scene recognition mostly use global features extracted from the entire video frame or a video segment. In this paper, a part-based method is proposed to aggregate local features from video frames. A pre-trained Fast R-CNN model is used to extract local convolutional feat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural computing & applications 2021-07, Vol.33 (13), p.7353-7370
Hauptverfasser:	Peng, Xiaoming, Bouzerdoum, Abdesselam, Phung, Son Lam
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Aggregates Artificial Intelligence Classification Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Data Mining and Knowledge Discovery Datasets Feature extraction Image Processing and Computer Vision Methods Neural networks Probability and Statistics in Computer Science S.i. : Dicta 2019 Segments Semantics Special Issue on DICTA 2019
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Existing methods for dynamic scene recognition mostly use global features extracted from the entire video frame or a video segment. In this paper, a part-based method is proposed to aggregate local features from video frames. A pre-trained Fast R-CNN model is used to extract local convolutional features from the regions of interest of training images. These features are clustered to locate representative parts. A set cover problem is then formulated to select the discriminative parts, which are further refined by fine-tuning the Fast R-CNN model. Local features from a video segment are extracted at different layers of the fine-tuned Fast R-CNN model and aggregated both spatially and temporally. Extensive experimental results show that the proposed method is very competitive with state-of-the-art approaches.
ISSN:	0941-0643 1433-3058
DOI:	10.1007/s00521-020-05415-3