Learning MLatent Representations for Generalized Zero-Shot Learning

In generative adversarial network (GAN) based zero-shot learning (ZSL) approaches, the synthesized unseen visual features are inevitably prone to seen classes since the feature generator is merely trained on seen references, which causes the inconsistency between visual features and their correspond...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2023, Vol.25, p.2252-2265
Hauptverfasser: Ye, Yalan, Pan, Tongjie, Luo, Tonghoujun, Li, Jingjing, Shen, Heng Tao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In generative adversarial network (GAN) based zero-shot learning (ZSL) approaches, the synthesized unseen visual features are inevitably prone to seen classes since the feature generator is merely trained on seen references, which causes the inconsistency between visual features and their corresponding semantic attributes. This visual-semantic inconsistency is primarily induced by the non-preserved semantic-relevant components and the non-rectified semantic-irrelevant low-level visual details. Existing generative models generally tackle the issue by aligning the distribution of the two modalities with an additional visual-to-semantic embedding, which tends to cause the hubness problem and ruin the diversity of visual modality. In this paper, we propose a novel generative model named learning modality-consistent latent representations GAN (LCR-GAN) to address the problem via embedding the visual features and their semantic attributes into a shared latent space. Specifically, to preserve the semantic-relevant components, the distributions of the two modalities are aligned by maximizing the mutual information between them. And to rectify the semantic-irrelevant visual details, the mutual information between original visual features and their latent representations is confined within an appropriate range. Meanwhile, the latent representations are decoded back to both modalities to further preserve the semantic-relevant components. Extensive evaluations on four public ZSL benchmarks validate the superiority of our method over other state-of-the-art methods.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2022.3145237