iMVS: Integrating multi-view information on multiple scales for 3D object recognition
3D object recognition is a fundamental task in 3D computer vision. View-based methods have received considerable attention due to their high efficiency and superior performance. To better capture the long-range dependencies among multi-view images, Transformer has recently been introduced into view-...
Gespeichert in:
Veröffentlicht in: | Journal of visual communication and image representation 2024-05, Vol.101, p.104175, Article 104175 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | 3D object recognition is a fundamental task in 3D computer vision. View-based methods have received considerable attention due to their high efficiency and superior performance. To better capture the long-range dependencies among multi-view images, Transformer has recently been introduced into view-based 3D object recognition and achieved excellent performance. However, the information among views on multiple scales is not utilized sufficiently in the existing Transformer-based methods. To address this limitation, we proposed a 3D object recognition method named iMVS to integrate Multi-View information on multiple Scales. Specifically, for the single-view image/features at each scale, we adopt a hybrid feature extraction module consisting of CNN and Transformer to jointly capture local and non-local information. For the extracted multi-view image features at each scale, we develop a feature transfer module including a view Transformer block to achieve the information transfer across views. Following a sequential process of the single-view feature extraction and multi-view feature transfer on multiple scales, the multi-view information is sufficiently interacted. Subsequently, the multi-scale features with multi-view information are fed into our designed feature aggregation module to generate a category-specific descriptor, where the adopted channel Transformer block facilitates the descriptor to be more expressive. Coupling with these designs, our method can fully exploit the information embedded within multi-view images. Experimental results on ModelNet40, ModelNet10 and a real-world dataset MVP-N demonstrate the superior performance of our method.
•The view-based 3D object recognition method makes full use of multi-view information•The transfer module achieves multi-view feature interaction on multiple scales•The channel Transformer-based module effectively aggregates multi-scale features•Experimental results demonstrate the effectiveness of our method iMVS |
---|---|
ISSN: | 1047-3203 1095-9076 |
DOI: | 10.1016/j.jvcir.2024.104175 |