Multiple instance deep learning for weakly-supervised visual object tracking

Intelligently tracking objects with varied shapes, color, lighting conditions, and backgrounds is an extremely useful application in many HCI applications, such as human body motion capture, hand gesture recognition, and virtual reality (VR) games. However, accurately tracking different objects unde...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Signal processing. Image communication 2020-05, Vol.84, p.115807, Article 115807
Hauptverfasser:	Huang, Kaining, Shi, Yan, Zhao, Fuqi, Zhang, Zijun, Tu, Shanshan
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Deep learning Gaussian mixture model Gesture recognition Human motion Lighting Motion capture Multi-view feature learning Multiple instance learning (MIL) Normal distribution Object recognition Object tracking Optical tracking Probabilistic models Semantics Statistical analysis Tags Virtual reality Weakly-supervised
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Intelligently tracking objects with varied shapes, color, lighting conditions, and backgrounds is an extremely useful application in many HCI applications, such as human body motion capture, hand gesture recognition, and virtual reality (VR) games. However, accurately tracking different objects under uncontrolled environments is a tough challenge due to the possibly dynamic object parts, varied lighting conditions, and sophisticated backgrounds. In this work, we propose a novel semantically-aware object tracking framework, wherein the key is weakly-supervised learning paradigm that optimally transfers the video-level semantic tags into various regions. More specifically, give a set of training video clips, each of which is associated with multiple video-level semantic tags, we first propose a weakly-supervised learning algorithm to transfer the semantic tags into various video regions. The key is a MIL (Zhong et al., 2020) [1]-based manifold embedding algorithm that maps the entire video regions into a semantic space, wherein the video-level semantic tags are well encoded. Afterward, for each video region, we use the semantic feature combined with the appearance feature as its representation. We designed a multi-view learning algorithm to optimally fuse the above two types of features. Based on the fused feature, we learn a probabilistic Gaussian mixture model to predict the target probability of each candidate window, where the window with the maximal probability is output as the tracking result. Comprehensive comparative results on a challenging pedestrian tracking task as well as the human hand gesture recognition have demonstrated the effectiveness of our method. Moreover, visualized tracking results have shown that non-rigid objects with moderate occlusions can be well localized by our method. •The proposed method can cope with variant number of lanes as well as lane changes.•In addition, our method is robust to different weather condition and can be achieved in real time.
ISSN:	0923-5965 1879-2677
DOI:	10.1016/j.image.2020.115807