Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification

In recent years, convolutional neural networks (CNNs) have proven their effectiveness in many challenging computer vision-based tasks, including small object classification. However, according to recent literature, this task is mainly based on 2D CNNs, and the small size of object instances makes th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Signal, image and video processing image and video processing, 2025-01, Vol.19 (2), Article 133
Hauptverfasser: Bayoudh, Khaled, Mtibaa, Abdellatif
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In recent years, convolutional neural networks (CNNs) have proven their effectiveness in many challenging computer vision-based tasks, including small object classification. However, according to recent literature, this task is mainly based on 2D CNNs, and the small size of object instances makes their recognition a challenging task. Since 3D CNNs are extremely tedious and time-consuming to learn, they cannot be used in a way that requires a trade-off between accuracy and efficiency. Moreover, due to the great success of Transformers in the field of natural language processing (NLP), a spatial Transformer can also be used as a robust feature transformer and has recently been successfully applied to computer vision tasks, including image classification. By incorporating attention mechanisms into the Transformers, many NLP and computer vision tasks can achieve excellent performance and help learn the contextual encoding of the input patches. However, the complexity of these tasks generally increases with the dimension of the input feature space. In this paper, we propose a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for better performance on a low-resolution dataset. First, the combination of a pre-trained deep CNN and a 3D CNN can significantly reduce the complexity and result in an accurate learning algorithm. Second, a pre-trained deep CNN model is used as a robust feature extractor and combined with a spatial Transformer to improve the representational power of the developed model and take advantage of the powerful global modeling capabilities of Transformers. Finally, spatial attention and channel attention are adaptively fused by focusing on all components in the input space to capture local and global spatial correlations on non-overlapping regions of the input representation. Experimental results show that the proposed framework has significant relevance in terms of efficiency and accuracy.
ISSN:1863-1703
1863-1711
DOI:10.1007/s11760-024-03696-y