iMVS: Integrating multi-view information on multiple scales for 3D object recognition

3D object recognition is a fundamental task in 3D computer vision. View-based methods have received considerable attention due to their high efficiency and superior performance. To better capture the long-range dependencies among multi-view images, Transformer has recently been introduced into view-...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of visual communication and image representation 2024-05, Vol.101, p.104175, Article 104175
Hauptverfasser: Jiang, Jiaqin, Liu, Zhao, Li, Jie, Tu, Jingmin, Li, Li, Yao, Jian
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:3D object recognition is a fundamental task in 3D computer vision. View-based methods have received considerable attention due to their high efficiency and superior performance. To better capture the long-range dependencies among multi-view images, Transformer has recently been introduced into view-based 3D object recognition and achieved excellent performance. However, the information among views on multiple scales is not utilized sufficiently in the existing Transformer-based methods. To address this limitation, we proposed a 3D object recognition method named iMVS to integrate Multi-View information on multiple Scales. Specifically, for the single-view image/features at each scale, we adopt a hybrid feature extraction module consisting of CNN and Transformer to jointly capture local and non-local information. For the extracted multi-view image features at each scale, we develop a feature transfer module including a view Transformer block to achieve the information transfer across views. Following a sequential process of the single-view feature extraction and multi-view feature transfer on multiple scales, the multi-view information is sufficiently interacted. Subsequently, the multi-scale features with multi-view information are fed into our designed feature aggregation module to generate a category-specific descriptor, where the adopted channel Transformer block facilitates the descriptor to be more expressive. Coupling with these designs, our method can fully exploit the information embedded within multi-view images. Experimental results on ModelNet40, ModelNet10 and a real-world dataset MVP-N demonstrate the superior performance of our method. •The view-based 3D object recognition method makes full use of multi-view information•The transfer module achieves multi-view feature interaction on multiple scales•The channel Transformer-based module effectively aggregates multi-scale features•Experimental results demonstrate the effectiveness of our method iMVS
ISSN:1047-3203
1095-9076
DOI:10.1016/j.jvcir.2024.104175