Diverse and tailored image generation for zero-shot multi-label classification

Recently, zero-shot multi-label classification has garnered considerable attention owing to its capacity to predict unseen labels without human annotations. Nevertheless, prevailing approaches often use seen classes as imperfect proxies for unseen classes, resulting in suboptimal performance. Drawin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2024-09, Vol.299, p.112077, Article 112077
Hauptverfasser: Zhang, Kaixin, Yuan, Zhixiang, Huang, Tao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Recently, zero-shot multi-label classification has garnered considerable attention owing to its capacity to predict unseen labels without human annotations. Nevertheless, prevailing approaches often use seen classes as imperfect proxies for unseen classes, resulting in suboptimal performance. Drawing inspiration from the success of text-to-image generation models in producing realistic images, we propose an innovative solution: generating synthetic data to construct a training set explicitly tailored for proxyless training of unseen labels. Our approach introduces a novel image generation framework that produces multi-label synthetic images of unseen classes for classifier training. To enhance the diversity of the generated images, we leverage a pretrained large language model to generate diverse prompts. Employing a pretrained multimodal contrastive language-image pretraining (CLIP) model as a discriminator, we assessed whether the generated images accurately represented the target classes. This enables automatic filtering of inaccurately generated images and preserves classifier accuracy. To refine the text prompts for more precise and effective multi-label object generation, we introduced a CLIP score-based discriminative loss to fine-tune the text encoder in the diffusion model. In addition, to enhance the visual features of the target task while maintaining the generalization of the original features and mitigating catastrophic forgetting resulting from fine-tuning the entire visual encoder, we propose a feature fusion module inspired by transformer attention mechanisms. This module aids in capturing the global dependencies between multiple objects more effectively. Extensive experimental results validated the effectiveness of our approach, demonstrating significant improvements over state-of-the-art methods. The code is available at https://github.com/TAKELAMAG/Diff-ZS-MLC.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2024.112077