VLMine: Long-Tail Data Mining with Vision Language Models

Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining ap...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Ye, Mao, Meyer, Gregory P, Zhang, Zaiwei, Park, Dennis, Mustikovela, Siva Karthik, Chai, Yuning, Wolff, Eric M
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Ye, Mao Meyer, Gregory P Zhang, Zaiwei Park, Dennis Mustikovela, Siva Karthik Chai, Yuning Wolff, Eric M
description	Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.
doi_str_mv	10.48550/arxiv.2409.15486
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_15486</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_15486</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_154863</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NbEw42SwDPPxzcxLtVLwyc9L1w1JzMxRcEksSVQACmbmpSuUZ5ZkKIRlFmfm5yn4JOallyampyr45qek5hTzMLCmJeYUp_JCaW4GeTfXEGcPXbAl8QVFmbmJRZXxIMviwZYZE1YBAL8SMwY</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>VLMine: Long-Tail Data Mining with Vision Language Models</title><source>arXiv.org</source><creator>Ye, Mao ; Meyer, Gregory P ; Zhang, Zaiwei ; Park, Dennis ; Mustikovela, Siva Karthik ; Chai, Yuning ; Wolff, Eric M</creator><creatorcontrib>Ye, Mao ; Meyer, Gregory P ; Zhang, Zaiwei ; Park, Dennis ; Mustikovela, Siva Karthik ; Chai, Yuning ; Wolff, Eric M</creatorcontrib><description>Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.</description><identifier>DOI: 10.48550/arxiv.2409.15486</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,883</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.15486$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.15486$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ye, Mao</creatorcontrib><creatorcontrib>Meyer, Gregory P</creatorcontrib><creatorcontrib>Zhang, Zaiwei</creatorcontrib><creatorcontrib>Park, Dennis</creatorcontrib><creatorcontrib>Mustikovela, Siva Karthik</creatorcontrib><creatorcontrib>Chai, Yuning</creatorcontrib><creatorcontrib>Wolff, Eric M</creatorcontrib><title>VLMine: Long-Tail Data Mining with Vision Language Models</title><description>Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NbEw42SwDPPxzcxLtVLwyc9L1w1JzMxRcEksSVQACmbmpSuUZ5ZkKIRlFmfm5yn4JOallyampyr45qek5hTzMLCmJeYUp_JCaW4GeTfXEGcPXbAl8QVFmbmJRZXxIMviwZYZE1YBAL8SMwY</recordid><startdate>20240923</startdate><enddate>20240923</enddate><creator>Ye, Mao</creator><creator>Meyer, Gregory P</creator><creator>Zhang, Zaiwei</creator><creator>Park, Dennis</creator><creator>Mustikovela, Siva Karthik</creator><creator>Chai, Yuning</creator><creator>Wolff, Eric M</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240923</creationdate><title>VLMine: Long-Tail Data Mining with Vision Language Models</title><author>Ye, Mao ; Meyer, Gregory P ; Zhang, Zaiwei ; Park, Dennis ; Mustikovela, Siva Karthik ; Chai, Yuning ; Wolff, Eric M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_154863</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Ye, Mao</creatorcontrib><creatorcontrib>Meyer, Gregory P</creatorcontrib><creatorcontrib>Zhang, Zaiwei</creatorcontrib><creatorcontrib>Park, Dennis</creatorcontrib><creatorcontrib>Mustikovela, Siva Karthik</creatorcontrib><creatorcontrib>Chai, Yuning</creatorcontrib><creatorcontrib>Wolff, Eric M</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ye, Mao</au><au>Meyer, Gregory P</au><au>Zhang, Zaiwei</au><au>Park, Dennis</au><au>Mustikovela, Siva Karthik</au><au>Chai, Yuning</au><au>Wolff, Eric M</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>VLMine: Long-Tail Data Mining with Vision Language Models</atitle><date>2024-09-23</date><risdate>2024</risdate><abstract>Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.</abstract><doi>10.48550/arxiv.2409.15486</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.15486
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_15486
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
title	VLMine: Long-Tail Data Mining with Vision Language Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T16%3A57%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=VLMine:%20Long-Tail%20Data%20Mining%20with%20Vision%20Language%20Models&rft.au=Ye,%20Mao&rft.date=2024-09-23&rft_id=info:doi/10.48550/arxiv.2409.15486&rft_dat=%3Carxiv_GOX%3E2409_15486%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true