Transductive Video Segmentation on Tree-Structured Model

This paper presents a transductive multicomponent video segmentation algorithm, which is capable of segmenting the predefined object of interest in the frames of a video sequence. To ensure temporal consistency, a temporal coherent parametric min-cut algorithm is developed to generate segmentation h...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2017-05, Vol.27 (5), p.992-1005
Hauptverfasser: Botao Wang, Zhihui Fu, Hongkai Xiong, Zheng, Yuan F.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper presents a transductive multicomponent video segmentation algorithm, which is capable of segmenting the predefined object of interest in the frames of a video sequence. To ensure temporal consistency, a temporal coherent parametric min-cut algorithm is developed to generate segmentation hypotheses based on visual cues and motion cues. Furthermore, each hypothesis is evaluated by an energy function from foreground resemblance, foreground/background divergence, boundary strength, and visual saliency. In particular, the state-of-the-art R-convolutional neural network descriptor is leveraged to encode the visual appearance of the foreground object. Finally, the optimal segmentation of the frame can be attained by assembling the segmentation hypotheses through the Monte Carlo approximation. In particular, multiple foreground components are built to capture the variances of the foreground object in shapes and poses. To group the frames into different components, a tree-structured graphical model named temporal tree is designed, where visually similar and temporally coherent frames are arranged in branches. The temporal tree can be constructed by iteratively adding frames to the active nodes by probabilistic clustering. In addition, each component, consisting of frames in the same branch, is characterized by a support vector machine classifier, which is learned in a transductive fashion by jointly maximizing the margin over the labeled frames and the unlabeled frames. As the frames from the same video sequence follow the same distribution, the transductive classifiers achieve stronger generalization capability than inductive ones. Experimental results on the public benchmarks demonstrate the effectiveness of the proposed method in comparison with other state-of-the-art supervised and unsupervised video segmentation methods.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2016.2527378