PersonMAE: Person Re-Identification Pre-Training With Masked AutoEncoders

Pre-training is playing an increasingly important role in learning generic feature representation for Person Re-identification (ReID). We argue that a high-quality ReID representation should have three properties, namely, multi-level awareness, occlusion robustness, and cross-region invariance. To t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2024, Vol.26, p.10029-10040
Hauptverfasser:	Hu, Hezhen, Dong, Xiaoyi, Bao, Jianmin, Chen, Dongdong, Yuan, Lu, Chen, Dong, Li, Houqiang
Format:	Artikel
Sprache:	eng
Schlagworte:	Decoding High-quality ReID representation Image color analysis masked autoencoder Pedestrians pre-training Robustness Semantics Task analysis Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Pre-training is playing an increasingly important role in learning generic feature representation for Person Re-identification (ReID). We argue that a high-quality ReID representation should have three properties, namely, multi-level awareness, occlusion robustness, and cross-region invariance. To this end, we propose a simple yet effective pre-training framework, namely PersonMAE, which involves two core designs into masked autoencoders to better serve the task of Person Re-ID. 1) PersonMAE generates two regions from the given image with RegionA as the input and RegionB as the prediction target. RegionA is corrupted with block-wise masking to mimic common occlusion in ReID and its remaining visible parts are fed into the encoder. 2) Then PersonMAE aims to predict the whole RegionB at both pixel level and semantic feature level. It encourages its pre-trained feature representations with the three properties mentioned above. These properties make PersonMAE compatible with downstream Person ReID tasks, leading to State-of-the-Art performance on four downstream ReID tasks, i.e., supervised (holistic and occluded setting), and unsupervised (UDA and USL setting). Notably, on the commonly adopted supervised setting, PersonMAE with ViT-B backbone achieves 79.8% and 69.5% mAP on the MSMT17 and OccDuke datasets, surpassing the previous State-of-the-Art by a large margin of +8.0 mAP, and +5.3 mAP, respectively.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3405649