MFVT: Multilevel Feature Fusion Vision Transformer and RAMix Data Augmentation for Fine-Grained Visual Categorization

The introduction and application of the Vision Transformer (ViT) has promoted the development of fine-grained visual categorization (FGVC). However, there are some problems when directly applying ViT to FGVC tasks. ViT only classifies using the class token in the last layer, ignoring the local and l...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Electronics (Basel) 2022-11, Vol.11 (21), p.3552
Hauptverfasser: Lv, Xinyao, Xia, Hao, Li, Na, Li, Xudong, Lan, Ruoming
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The introduction and application of the Vision Transformer (ViT) has promoted the development of fine-grained visual categorization (FGVC). However, there are some problems when directly applying ViT to FGVC tasks. ViT only classifies using the class token in the last layer, ignoring the local and low-level features necessary for FGVC. We propose a ViT-based multilevel feature fusion transformer (MFVT) for FGVC tasks. In this framework, with reference to ViT, the backbone network adopts 12 layers of Transformer blocks, divides it into four stages, and adds multilevel feature fusion (MFF) between Transformer layers. We also design RAMix, a CutMix-based data augmentation strategy that uses the resize strategy for crop-paste images and label assignment based on attention. Experiments on the CUB-200-2011, Stanford Dogs, and iNaturalist 2017 datasets gave competitive results, especially on the challenging iNaturalist 2017, with an accuracy rate of 72.6%.
ISSN:2079-9292
2079-9292
DOI:10.3390/electronics11213552