Improving embedding learning by virtual attribute decoupling for text-based person search

This paper considers the problem of text-based person search, which aims to find the target person based on a query textual description. Previous methods commonly focus on learning shared image-text embeddings, but largely ignore the effect of pedestrian attributes. Attributes are fine-grained infor...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural computing & applications 2022-04, Vol.34 (7), p.5625-5647
Hauptverfasser:	Wang, Chengji, Luo, Zhiming, Lin, Yaojin, Li, Shaozi
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Artificial Intelligence Coders Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Data Mining and Knowledge Discovery Decoupling Embedding Encoders-Decoders Image enhancement Image Processing and Computer Vision Learning Modules Original Article Probability and Statistics in Computer Science Searching Semantics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper considers the problem of text-based person search, which aims to find the target person based on a query textual description. Previous methods commonly focus on learning shared image-text embeddings, but largely ignore the effect of pedestrian attributes. Attributes are fine-grained information, which provide mid-level semantics and have been demonstrated to be effective in traditional image-based person search. However, in text-based person search, it is hard to incorporate attribute information to learn discriminative image-text embeddings, because (1) the description of attributes could be various at different texts and (2) it is hard to decouple attributes-related information without the help of attribute annotations. In this paper, we propose an improving embedding learning by virtual attribute decoupling (iVAD) model for learning modality-invariant image-text embeddings. To the best of our knowledge, this is the first work which performs unsupervised attribute decoupling in text-based person search task. In the iVAD, we first propose a novel virtual attribute decoupling (VAD) module which uses an encoder-decoder embedding learning structure to decompose attribute information from image and text. In this module, we regard the pedestrian attributes as a hidden vector and obtain attribute-related embeddings. In addition, different from previous works which separates attribute learning from image-text embedding learning, we propose a hierarchical feature embedding framework. We incorporate the attribute-related embeddings into learned image-text embeddings by an attribute-enhanced feature embedding (AEFE) module. The proposed AEFE module can utilize attribute information to improve discriminability of learned features. Extensive evaluations demonstrate the superiority of our method over a wide variety of state-of-the-art methods on the CUHK-PEDES dataset. The experimental results on Caltech-UCSD Birds (CUB), Oxford-102 Flowers (Flowers) and Flickr30K verify the effectiveness of the proposed approach. A further visualization shows that the proposed iVAD model can effectively discover the co-occurring pedestrian attributes in corresponded image-text pairs.
ISSN:	0941-0643 1433-3058
DOI:	10.1007/s00521-021-06734-9