Video frame prediction with dual-stream deep network emphasizing motions and content details

Video frame prediction is both challenging and critical for computer vision. Though the research on predicting video frames has gradually shifted from pixel-law based methods to motion based ones, existing predictors often generate ambiguous future frames, especially for long-term predictions. This...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Applied soft computing 2022-08, Vol.125, p.109170, Article 109170
Hauptverfasser: Huang, Qingming, Li, Zhongxiao, Zheng, Liying, Yang, Tianyi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Video frame prediction is both challenging and critical for computer vision. Though the research on predicting video frames has gradually shifted from pixel-law based methods to motion based ones, existing predictors often generate ambiguous future frames, especially for long-term predictions. This paper proposes a composed model to generate future frames with more details. First, to further exploit motion information, we design a single motion decoder to strengthen the efficiency of the motion encoder in the original motion-content network (MCnet). Second, to alleviate prediction ambiguousness, we use both edges with and without semantic meanings from the holistically-nested edge detection (HED) module as content details. Third, based on the conclusion that the mean squared error (MSE) loss and the traditional generative adversarial learning framework cause the unsatisfied predictions of MCnet, we design a composite loss function that can guide our model to simultaneously focus on motions and content details. Also, based on the abovementioned conclusion, we finally embed our model in an improved generative adversarial network, which further enhances its performance. Experimental results on the benchmark KTH and UCF101 datasets show that our model outperforms the state-of-the-art predictors, such as the basic MCnet, the predictive neural network (PredNet), and the PredNet with a reduced-gate convolutional network (rgc-PredNet), in terms of peak signal to noise ratio (PSNR) and structural similarity index measure (SSIM), especially for long-term video frame prediction. •A motion decoder is added to MCnet to emphasize the motion information.•Multi-level content details are introduced to alleviate the ambiguousness of a prediction.•WGAN-GP framework is used to construct our model.•A composite loss is designed to guide our model to focus on motions and content details.
ISSN:1568-4946
1872-9681
DOI:10.1016/j.asoc.2022.109170