VLMine: Long-Tail Data Mining with Vision Language Models
Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining ap...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Ensuring robust performance on long-tail examples is an important problem for
many real-world applications of machine learning, such as autonomous driving.
This work focuses on the problem of identifying rare examples within a corpus
of unlabeled data. We propose a simple and scalable data mining approach that
leverages the knowledge contained within a large vision language model (VLM).
Our approach utilizes a VLM to summarize the content of an image into a set of
keywords, and we identify rare examples based on keyword frequency. We find
that the VLM offers a distinct signal for identifying long-tail examples when
compared to conventional methods based on model uncertainty. Therefore, we
propose a simple and general approach for integrating signals from multiple
mining algorithms. We evaluate the proposed method on two diverse tasks: 2D
image classification, in which inter-class variation is the primary source of
data diversity, and on 3D object detection, where intra-class variation is the
main concern. Furthermore, through the detection task, we demonstrate that the
knowledge extracted from 2D images is transferable to the 3D domain. Our
experiments consistently show large improvements (between 10\% and 50\%) over
the baseline techniques on several representative benchmarks: ImageNet-LT,
Places-LT, and the Waymo Open Dataset. |
---|---|
DOI: | 10.48550/arxiv.2409.15486 |