Prompting Scientific Names for Zero-Shot Species Recognition

Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for whi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Parashar, Shubham, Lin, Zhiqiu, Li, Yanan, Kong, Shu
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Parashar, Shubham Lin, Zhiqiu Li, Yanan Kong, Shu
description	Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for which their scientific names are written in Latin or Greek. Indeed, CLIP performs poorly for zero-shot species recognition with prompts that use scientific names, e.g., "a photo of Lepus Timidus" (which is a scientific name in Latin). Because these names are usually not included in CLIP's training set. To improve performance, prior works propose to use large-language models (LLMs) to generate descriptions (e.g., of species color and shape) and additionally use them in prompts. We find that they bring only marginal gains. Differently, we are motivated to translate scientific names (e.g., Lepus Timidus) to common English names (e.g., mountain hare) and use such in the prompts. We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.
doi_str_mv	10.48550/arxiv.2310.09929
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_09929</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_09929</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-73186caa9f090a6ce4a747e192f9b5037bc935ced6a9ba1fbedc4275a31c61ab3</originalsourceid><addsrcrecordid>eNotj81KxDAUhbNxIaMP4Mq8QMekaZK54EYG_2BQsbNyU27u3IwB25S0iL694-jqwOHwcT4hLrRaNitr1RWWr_S5rM2hUAA1nIrrl5L7cU7DXraUeJhTTCSfsOdJxlzkG5dcte95lu3Ih8EkX5nyfkhzysOZOIn4MfH5fy7E9u52u36oNs_3j-ubTYXOQ-WNXjlChKhAoSNu0DeeNdQRglXGBwJjiXcOIaCOgXfU1N6i0eQ0BrMQl3_Y4_1uLKnH8t39anRHDfMD8cRC-w</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Prompting Scientific Names for Zero-Shot Species Recognition</title><source>arXiv.org</source><creator>Parashar, Shubham ; Lin, Zhiqiu ; Li, Yanan ; Kong, Shu</creator><creatorcontrib>Parashar, Shubham ; Lin, Zhiqiu ; Li, Yanan ; Kong, Shu</creatorcontrib><description>Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for which their scientific names are written in Latin or Greek. Indeed, CLIP performs poorly for zero-shot species recognition with prompts that use scientific names, e.g., "a photo of Lepus Timidus" (which is a scientific name in Latin). Because these names are usually not included in CLIP's training set. To improve performance, prior works propose to use large-language models (LLMs) to generate descriptions (e.g., of species color and shape) and additionally use them in prompts. We find that they bring only marginal gains. Differently, we are motivated to translate scientific names (e.g., Lepus Timidus) to common English names (e.g., mountain hare) and use such in the prompts. We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.</description><identifier>DOI: 10.48550/arxiv.2310.09929</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.09929$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.09929$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Parashar, Shubham</creatorcontrib><creatorcontrib>Lin, Zhiqiu</creatorcontrib><creatorcontrib>Li, Yanan</creatorcontrib><creatorcontrib>Kong, Shu</creatorcontrib><title>Prompting Scientific Names for Zero-Shot Species Recognition</title><description>Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for which their scientific names are written in Latin or Greek. Indeed, CLIP performs poorly for zero-shot species recognition with prompts that use scientific names, e.g., "a photo of Lepus Timidus" (which is a scientific name in Latin). Because these names are usually not included in CLIP's training set. To improve performance, prior works propose to use large-language models (LLMs) to generate descriptions (e.g., of species color and shape) and additionally use them in prompts. We find that they bring only marginal gains. Differently, we are motivated to translate scientific names (e.g., Lepus Timidus) to common English names (e.g., mountain hare) and use such in the prompts. We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAUhbNxIaMP4Mq8QMekaZK54EYG_2BQsbNyU27u3IwB25S0iL694-jqwOHwcT4hLrRaNitr1RWWr_S5rM2hUAA1nIrrl5L7cU7DXraUeJhTTCSfsOdJxlzkG5dcte95lu3Ih8EkX5nyfkhzysOZOIn4MfH5fy7E9u52u36oNs_3j-ubTYXOQ-WNXjlChKhAoSNu0DeeNdQRglXGBwJjiXcOIaCOgXfU1N6i0eQ0BrMQl3_Y4_1uLKnH8t39anRHDfMD8cRC-w</recordid><startdate>20231015</startdate><enddate>20231015</enddate><creator>Parashar, Shubham</creator><creator>Lin, Zhiqiu</creator><creator>Li, Yanan</creator><creator>Kong, Shu</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231015</creationdate><title>Prompting Scientific Names for Zero-Shot Species Recognition</title><author>Parashar, Shubham ; Lin, Zhiqiu ; Li, Yanan ; Kong, Shu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-73186caa9f090a6ce4a747e192f9b5037bc935ced6a9ba1fbedc4275a31c61ab3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Parashar, Shubham</creatorcontrib><creatorcontrib>Lin, Zhiqiu</creatorcontrib><creatorcontrib>Li, Yanan</creatorcontrib><creatorcontrib>Kong, Shu</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Parashar, Shubham</au><au>Lin, Zhiqiu</au><au>Li, Yanan</au><au>Kong, Shu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Prompting Scientific Names for Zero-Shot Species Recognition</atitle><date>2023-10-15</date><risdate>2023</risdate><abstract>Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for which their scientific names are written in Latin or Greek. Indeed, CLIP performs poorly for zero-shot species recognition with prompts that use scientific names, e.g., "a photo of Lepus Timidus" (which is a scientific name in Latin). Because these names are usually not included in CLIP's training set. To improve performance, prior works propose to use large-language models (LLMs) to generate descriptions (e.g., of species color and shape) and additionally use them in prompts. We find that they bring only marginal gains. Differently, we are motivated to translate scientific names (e.g., Lepus Timidus) to common English names (e.g., mountain hare) and use such in the prompts. We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.</abstract><doi>10.48550/arxiv.2310.09929</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2310.09929
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2310_09929
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
title	Prompting Scientific Names for Zero-Shot Species Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-20T20%3A48%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Prompting%20Scientific%20Names%20for%20Zero-Shot%20Species%20Recognition&rft.au=Parashar,%20Shubham&rft.date=2023-10-15&rft_id=info:doi/10.48550/arxiv.2310.09929&rft_dat=%3Carxiv_GOX%3E2310_09929%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true