TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation

Video frame interpolation (VFI) aims to synthesize an intermediate frame between two consecutive frames. State-of-the-art approaches usually adopt a two-step solution, which includes 1) generating locally-warped pixels by calculating the optical flow based on pre-defined motion patterns (e.g., unifo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing 2023-01, Vol.PP, p.1-1
Hauptverfasser:	Liu, Chengxu, Yang, Huan, Fu, Jianlong, Qian, Xueming
Format:	Artikel
Sprache:	eng
Schlagworte:	Consistent motion field Frames (data processing) Interpolation Kernel Optical flow Optical flow (image analysis) Pixels Task analysis Trajectory Trajectory-aware Transformer Transformers Video frame interpolation Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Video frame interpolation (VFI) aims to synthesize an intermediate frame between two consecutive frames. State-of-the-art approaches usually adopt a two-step solution, which includes 1) generating locally-warped pixels by calculating the optical flow based on pre-defined motion patterns (e.g., uniform motion, symmetric motion), 2) blending the warped pixels to form a full frame through deep neural synthesis networks. However, for various complicated motions (e.g., non-uniform motion, turn around), such improper assumptions about pre-defined motion patterns introduce the inconsistent warping from the two consecutive frames. This leads to the warped features for new frames are usually not aligned, yielding distortion and blur, especially when large and complex motions occur. To solve this issue, in this paper we propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI). In particular, we formulate the warped features with inconsistent motions as query tokens, and formulate relevant regions in a motion trajectory from two original consecutive frames into keys and values. Self-attention is learned on relevant tokens along the trajectory to blend the pristine features into intermediate frames through end-to-end training. Experimental results demonstrate that our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks. Both code and pre-trained models will be released at https://github.com/ChengxuLiu/TTVFI.git.
ISSN:	1057-7149 1941-0042
DOI:	10.1109/TIP.2023.3302990