Prompting Scientific Names for Zero-Shot Species Recognition

Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for whi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Parashar, Shubham, Lin, Zhiqiu, Li, Yanan, Kong, Shu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Parashar, Shubham
Lin, Zhiqiu
Li, Yanan
Kong, Shu
description Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for which their scientific names are written in Latin or Greek. Indeed, CLIP performs poorly for zero-shot species recognition with prompts that use scientific names, e.g., "a photo of Lepus Timidus" (which is a scientific name in Latin). Because these names are usually not included in CLIP's training set. To improve performance, prior works propose to use large-language models (LLMs) to generate descriptions (e.g., of species color and shape) and additionally use them in prompts. We find that they bring only marginal gains. Differently, we are motivated to translate scientific names (e.g., Lepus Timidus) to common English names (e.g., mountain hare) and use such in the prompts. We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.
doi_str_mv 10.48550/arxiv.2310.09929
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_09929</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_09929</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-73186caa9f090a6ce4a747e192f9b5037bc935ced6a9ba1fbedc4275a31c61ab3</originalsourceid><addsrcrecordid>eNotj81KxDAUhbNxIaMP4Mq8QMekaZK54EYG_2BQsbNyU27u3IwB25S0iL694-jqwOHwcT4hLrRaNitr1RWWr_S5rM2hUAA1nIrrl5L7cU7DXraUeJhTTCSfsOdJxlzkG5dcte95lu3Ih8EkX5nyfkhzysOZOIn4MfH5fy7E9u52u36oNs_3j-ubTYXOQ-WNXjlChKhAoSNu0DeeNdQRglXGBwJjiXcOIaCOgXfU1N6i0eQ0BrMQl3_Y4_1uLKnH8t39anRHDfMD8cRC-w</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Prompting Scientific Names for Zero-Shot Species Recognition</title><source>arXiv.org</source><creator>Parashar, Shubham ; Lin, Zhiqiu ; Li, Yanan ; Kong, Shu</creator><creatorcontrib>Parashar, Shubham ; Lin, Zhiqiu ; Li, Yanan ; Kong, Shu</creatorcontrib><description>Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for which their scientific names are written in Latin or Greek. Indeed, CLIP performs poorly for zero-shot species recognition with prompts that use scientific names, e.g., "a photo of Lepus Timidus" (which is a scientific name in Latin). Because these names are usually not included in CLIP's training set. To improve performance, prior works propose to use large-language models (LLMs) to generate descriptions (e.g., of species color and shape) and additionally use them in prompts. We find that they bring only marginal gains. Differently, we are motivated to translate scientific names (e.g., Lepus Timidus) to common English names (e.g., mountain hare) and use such in the prompts. We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.</description><identifier>DOI: 10.48550/arxiv.2310.09929</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.09929$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.09929$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Parashar, Shubham</creatorcontrib><creatorcontrib>Lin, Zhiqiu</creatorcontrib><creatorcontrib>Li, Yanan</creatorcontrib><creatorcontrib>Kong, Shu</creatorcontrib><title>Prompting Scientific Names for Zero-Shot Species Recognition</title><description>Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for which their scientific names are written in Latin or Greek. Indeed, CLIP performs poorly for zero-shot species recognition with prompts that use scientific names, e.g., "a photo of Lepus Timidus" (which is a scientific name in Latin). Because these names are usually not included in CLIP's training set. To improve performance, prior works propose to use large-language models (LLMs) to generate descriptions (e.g., of species color and shape) and additionally use them in prompts. We find that they bring only marginal gains. Differently, we are motivated to translate scientific names (e.g., Lepus Timidus) to common English names (e.g., mountain hare) and use such in the prompts. We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAUhbNxIaMP4Mq8QMekaZK54EYG_2BQsbNyU27u3IwB25S0iL694-jqwOHwcT4hLrRaNitr1RWWr_S5rM2hUAA1nIrrl5L7cU7DXraUeJhTTCSfsOdJxlzkG5dcte95lu3Ih8EkX5nyfkhzysOZOIn4MfH5fy7E9u52u36oNs_3j-ubTYXOQ-WNXjlChKhAoSNu0DeeNdQRglXGBwJjiXcOIaCOgXfU1N6i0eQ0BrMQl3_Y4_1uLKnH8t39anRHDfMD8cRC-w</recordid><startdate>20231015</startdate><enddate>20231015</enddate><creator>Parashar, Shubham</creator><creator>Lin, Zhiqiu</creator><creator>Li, Yanan</creator><creator>Kong, Shu</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231015</creationdate><title>Prompting Scientific Names for Zero-Shot Species Recognition</title><author>Parashar, Shubham ; Lin, Zhiqiu ; Li, Yanan ; Kong, Shu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-73186caa9f090a6ce4a747e192f9b5037bc935ced6a9ba1fbedc4275a31c61ab3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Parashar, Shubham</creatorcontrib><creatorcontrib>Lin, Zhiqiu</creatorcontrib><creatorcontrib>Li, Yanan</creatorcontrib><creatorcontrib>Kong, Shu</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Parashar, Shubham</au><au>Lin, Zhiqiu</au><au>Li, Yanan</au><au>Kong, Shu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Prompting Scientific Names for Zero-Shot Species Recognition</atitle><date>2023-10-15</date><risdate>2023</risdate><abstract>Trained on web-scale image-text pairs, Vision-Language Models (VLMs) such as CLIP can recognize images of common objects in a zero-shot fashion. However, it is underexplored how to use CLIP for zero-shot recognition of highly specialized concepts, e.g., species of birds, plants, and animals, for which their scientific names are written in Latin or Greek. Indeed, CLIP performs poorly for zero-shot species recognition with prompts that use scientific names, e.g., "a photo of Lepus Timidus" (which is a scientific name in Latin). Because these names are usually not included in CLIP's training set. To improve performance, prior works propose to use large-language models (LLMs) to generate descriptions (e.g., of species color and shape) and additionally use them in prompts. We find that they bring only marginal gains. Differently, we are motivated to translate scientific names (e.g., Lepus Timidus) to common English names (e.g., mountain hare) and use such in the prompts. We find that common names are more likely to be included in CLIP's training set, and prompting them achieves 2$\sim$5 times higher accuracy on benchmarking datasets of fine-grained species recognition.</abstract><doi>10.48550/arxiv.2310.09929</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2310.09929
ispartof
issn
language eng
recordid cdi_arxiv_primary_2310_09929
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
title Prompting Scientific Names for Zero-Shot Species Recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-20T20%3A48%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Prompting%20Scientific%20Names%20for%20Zero-Shot%20Species%20Recognition&rft.au=Parashar,%20Shubham&rft.date=2023-10-15&rft_id=info:doi/10.48550/arxiv.2310.09929&rft_dat=%3Carxiv_GOX%3E2310_09929%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true