DILF: Differentiable rendering-based multi-view Image–Language Fusion for zero-shot 3D shape understanding

Zero-shot 3D shape understanding aims to recognize “unseen” 3D categories that are not present in training data. Recently, Contrastive Language–Image Pre-training (CLIP) has shown promising open-world performance in zero-shot 3D shape understanding tasks by information fusion among language and 3D m...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information fusion 2024-02, Vol.102, p.102033, Article 102033
Hauptverfasser: Ning, Xin, Yu, Zaiyang, Li, Lusi, Li, Weijun, Tiwari, Prayag
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Zero-shot 3D shape understanding aims to recognize “unseen” 3D categories that are not present in training data. Recently, Contrastive Language–Image Pre-training (CLIP) has shown promising open-world performance in zero-shot 3D shape understanding tasks by information fusion among language and 3D modality. It first renders 3D objects into multiple 2D image views and then learns to understand the semantic relationships between the textual descriptions and images, enabling the model to generalize to new and unseen categories. However, existing studies in zero-shot 3D shape understanding rely on predefined rendering parameters, resulting in repetitive, redundant, and low-quality views. This limitation hinders the model’s ability to fully comprehend 3D shapes and adversely impacts the text–image fusion in a shared latent space. To this end, we propose a novel approach called Differentiable rendering-based multi-view Image–Language Fusion (DILF) for zero-shot 3D shape understanding. Specifically, DILF leverages large-scale language models (LLMs) to generate textual prompts enriched with 3D semantics and designs a differentiable renderer with learnable rendering parameters to produce representative multi-view images. These rendering parameters can be iteratively updated using a text–image fusion loss, which aids in parameters’ regression, allowing the model to determine the optimal viewpoint positions for each 3D object. Then a group-view mechanism is introduced to model interdependencies across views, enabling efficient information fusion to achieve a more comprehensive 3D shape understanding. Experimental results can demonstrate that DILF outperforms state-of-the-art methods for zero-shot 3D classification while maintaining competitive performance for standard 3D classification. The code is available at https://github.com/yuzaiyang123/DILP. •A differentiable renderer fuses explicit text guidance into rendering process to produce informative multi-view images.•We propose the group-view mechanism and LLM-assisted textual feature learning, enabling efficient text–image fusion.•It achieves state-of-the-art for zero-shot 3D classification, competitive in standard 3D classification.
ISSN:1566-2535
1872-6305
1872-6305
DOI:10.1016/j.inffus.2023.102033