Self-supervised representation learning by predicting visual permutations

We propose a self-supervised learning method to uncover the spatial or temporal structure of visual data by identifying the position of a patch within an image or the position of a video frame over time, which is related to Jigsaw puzzle reassembly problem in previous works. A Jigsaw puzzle can be s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2020-12, Vol.210, p.106534, Article 106534
Hauptverfasser: Zhao, Qilu, Dong, Junyu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We propose a self-supervised learning method to uncover the spatial or temporal structure of visual data by identifying the position of a patch within an image or the position of a video frame over time, which is related to Jigsaw puzzle reassembly problem in previous works. A Jigsaw puzzle can be seen as a shuffled sequence, which is generated by shuffling image patches or video frames according to an unknown permutation. The task of predicting the visual permutations can be used to train a learning system to capture structural information which is important for semantic-level tasks, such as object recognition and action recognition. To this end, we propose a multi-task learning framework where a group of principal tasks aims to predict the index of each sample in the original sequence, and a group of auxiliary tasks aims to predict the spatial or temporal relation of adjacent samples in the shuffled sequence. Our scheme can handle the whole space of permutations and is fairly scalable, and it is also generic to solve many problems such as self-supervised representation learning, relative attributes, and learning to rank. Our method achieves state-of-the-art performance on the STL-10 benchmarks for unsupervised representation learning, and it is competitive with state-of-the-art performance on UCF-101 and HMDB-51 as a pretraining method for action recognition. In addition, we apply the proposed method on age comparison task to prove it is generic to solve ranking problems. •The proposed method is capable of handling the whole space of permutations.•The designed architecture is flexible and extensible.•Our method achieved the state-of-the-art performance on STL-10.•Our method is generic to solve ranking problem.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2020.106534