Comparison between transformers and convolutional models for fine-grained classification of insects
Fine-grained classification is challenging due to the difficulty of finding discriminatory features. This problem is exacerbated when applied to identifying species within the same taxonomical class. This is because species are often sharing morphological characteristics that make them difficult to...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Fine-grained classification is challenging due to the difficulty of finding
discriminatory features. This problem is exacerbated when applied to
identifying species within the same taxonomical class. This is because species
are often sharing morphological characteristics that make them difficult to
differentiate. We consider the taxonomical class of Insecta. The identification
of insects is essential in biodiversity monitoring as they are one of the
inhabitants at the base of many ecosystems. Citizen science is doing brilliant
work of collecting images of insects in the wild giving the possibility to
experts to create improved distribution maps in all countries. We have billions
of images that need to be automatically classified and deep neural network
algorithms are one of the main techniques explored for fine-grained tasks. At
the SOTA, the field of deep learning algorithms is extremely fruitful, so how
to identify the algorithm to use? We focus on Odonata and Coleoptera orders,
and we propose an initial comparative study to analyse the two best-known layer
structures for computer vision: transformer and convolutional layers. We
compare the performance of T2TViT, a fully transformer-base, EfficientNet, a
fully convolutional-base, and ViTAE, a hybrid. We analyse the performance of
the three models in identical conditions evaluating the performance per
species, per morph together with sex, the inference time, and the overall
performance with unbalanced datasets of images from smartphones. Although we
observe high performances with all three families of models, our analysis shows
that the hybrid model outperforms the fully convolutional-base and fully
transformer-base models on accuracy performance and the fully transformer-base
model outperforms the others on inference speed and, these prove the
transformer to be robust to the shortage of samples and to be faster at
inference time. |
---|---|
DOI: | 10.48550/arxiv.2307.11112 |