Efficient Curation of Invertebrate Image Datasets Using Feature Embeddings and Automatic Size Comparison
The amount of image datasets collected for environmental monitoring purposes has increased in the past years as computer vision assisted methods have gained interest. Computer vision applications rely on high-quality datasets, making data curation important. However, data curation is often done ad-h...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The amount of image datasets collected for environmental monitoring purposes
has increased in the past years as computer vision assisted methods have gained
interest. Computer vision applications rely on high-quality datasets, making
data curation important. However, data curation is often done ad-hoc and the
methods used are rarely published. We present a method for curating large-scale
image datasets of invertebrates that contain multiple images of the same taxa
and/or specimens and have relatively uniform background in the images. Our
approach is based on extracting feature embeddings with pretrained deep neural
networks, and using these embeddings to find visually most distinct images by
comparing their embeddings to the group prototype embedding. Also, we show that
a simple area-based size comparison approach is able to find a lot of common
erroneous images, such as images containing detached body parts and
misclassified samples. In addition to the method, we propose using novel
metrics for evaluating human-in-the-loop outlier detection methods. The
implementations of the proposed curation methods, as well as a benchmark
dataset containing annotated erroneous images, are publicly available in
https://github.com/mikkoim/taxonomist-studio. |
---|---|
DOI: | 10.48550/arxiv.2412.15844 |