Boosting Segment Anything Model Towards Open-Vocabulary Learning

The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object sema...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Han, Xumeng, Wei, Longhui, Yu, Xuehui, Dou, Zhiyang, He, Xin, Wang, Kuiran, Han, Zhenjun, Tian, Qi
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Han, Xumeng Wei, Longhui Yu, Xuehui Dou, Zhiyang He, Xin Wang, Kuiran Han, Zhenjun Tian, Qi
description	The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.
doi_str_mv	10.48550/arxiv.2312.03628
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_03628</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_03628</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-542c78c6807d45c2b758a6a1973fea62b4a93486083eb085274c7454b4a1d8773</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAr_QILj173dUSpeUlAXRN1GN47bRkrtygmP_j1py2qko9FoDmN3hcg1GiMeKP1237lUhcyFshKv2eNTjMPYhS3_9Nu9DyNfhOO4O4GP2PqeV_GHUjvw1cGHbB0dNV89pSMvPaUw1W7Y1Yb6wd_-54xVL8_V8i0rV6_vy0WZkQXMjJYO0FkU0GrjZAMGyVIxB7XxZGWjaa40WoHKNwKNBO1AGz3xokUANWP3l9mzQn1I3X56UZ9U6rOK-gPL0kKs</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Boosting Segment Anything Model Towards Open-Vocabulary Learning</title><source>arXiv.org</source><creator>Han, Xumeng ; Wei, Longhui ; Yu, Xuehui ; Dou, Zhiyang ; He, Xin ; Wang, Kuiran ; Han, Zhenjun ; Tian, Qi</creator><creatorcontrib>Han, Xumeng ; Wei, Longhui ; Yu, Xuehui ; Dou, Zhiyang ; He, Xin ; Wang, Kuiran ; Han, Zhenjun ; Tian, Qi</creatorcontrib><description>The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.</description><identifier>DOI: 10.48550/arxiv.2312.03628</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.03628$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.03628$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Han, Xumeng</creatorcontrib><creatorcontrib>Wei, Longhui</creatorcontrib><creatorcontrib>Yu, Xuehui</creatorcontrib><creatorcontrib>Dou, Zhiyang</creatorcontrib><creatorcontrib>He, Xin</creatorcontrib><creatorcontrib>Wang, Kuiran</creatorcontrib><creatorcontrib>Han, Zhenjun</creatorcontrib><creatorcontrib>Tian, Qi</creatorcontrib><title>Boosting Segment Anything Model Towards Open-Vocabulary Learning</title><description>The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAr_QILj173dUSpeUlAXRN1GN47bRkrtygmP_j1py2qko9FoDmN3hcg1GiMeKP1237lUhcyFshKv2eNTjMPYhS3_9Nu9DyNfhOO4O4GP2PqeV_GHUjvw1cGHbB0dNV89pSMvPaUw1W7Y1Yb6wd_-54xVL8_V8i0rV6_vy0WZkQXMjJYO0FkU0GrjZAMGyVIxB7XxZGWjaa40WoHKNwKNBO1AGz3xokUANWP3l9mzQn1I3X56UZ9U6rOK-gPL0kKs</recordid><startdate>20231206</startdate><enddate>20231206</enddate><creator>Han, Xumeng</creator><creator>Wei, Longhui</creator><creator>Yu, Xuehui</creator><creator>Dou, Zhiyang</creator><creator>He, Xin</creator><creator>Wang, Kuiran</creator><creator>Han, Zhenjun</creator><creator>Tian, Qi</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231206</creationdate><title>Boosting Segment Anything Model Towards Open-Vocabulary Learning</title><author>Han, Xumeng ; Wei, Longhui ; Yu, Xuehui ; Dou, Zhiyang ; He, Xin ; Wang, Kuiran ; Han, Zhenjun ; Tian, Qi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-542c78c6807d45c2b758a6a1973fea62b4a93486083eb085274c7454b4a1d8773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Han, Xumeng</creatorcontrib><creatorcontrib>Wei, Longhui</creatorcontrib><creatorcontrib>Yu, Xuehui</creatorcontrib><creatorcontrib>Dou, Zhiyang</creatorcontrib><creatorcontrib>He, Xin</creatorcontrib><creatorcontrib>Wang, Kuiran</creatorcontrib><creatorcontrib>Han, Zhenjun</creatorcontrib><creatorcontrib>Tian, Qi</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Han, Xumeng</au><au>Wei, Longhui</au><au>Yu, Xuehui</au><au>Dou, Zhiyang</au><au>He, Xin</au><au>Wang, Kuiran</au><au>Han, Zhenjun</au><au>Tian, Qi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Boosting Segment Anything Model Towards Open-Vocabulary Learning</atitle><date>2023-12-06</date><risdate>2023</risdate><abstract>The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.</abstract><doi>10.48550/arxiv.2312.03628</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2312.03628
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2312_03628
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Boosting Segment Anything Model Towards Open-Vocabulary Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T08%3A22%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Boosting%20Segment%20Anything%20Model%20Towards%20Open-Vocabulary%20Learning&rft.au=Han,%20Xumeng&rft.date=2023-12-06&rft_id=info:doi/10.48550/arxiv.2312.03628&rft_dat=%3Carxiv_GOX%3E2312_03628%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true