Boosting Segment Anything Model Towards Open-Vocabulary Learning

The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object sema...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Han, Xumeng, Wei, Longhui, Yu, Xuehui, Dou, Zhiyang, He, Xin, Wang, Kuiran, Han, Zhenjun, Tian, Qi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Han, Xumeng
Wei, Longhui
Yu, Xuehui
Dou, Zhiyang
He, Xin
Wang, Kuiran
Han, Zhenjun
Tian, Qi
description The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.
doi_str_mv 10.48550/arxiv.2312.03628
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_03628</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_03628</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-542c78c6807d45c2b758a6a1973fea62b4a93486083eb085274c7454b4a1d8773</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAr_QILj173dUSpeUlAXRN1GN47bRkrtygmP_j1py2qko9FoDmN3hcg1GiMeKP1237lUhcyFshKv2eNTjMPYhS3_9Nu9DyNfhOO4O4GP2PqeV_GHUjvw1cGHbB0dNV89pSMvPaUw1W7Y1Yb6wd_-54xVL8_V8i0rV6_vy0WZkQXMjJYO0FkU0GrjZAMGyVIxB7XxZGWjaa40WoHKNwKNBO1AGz3xokUANWP3l9mzQn1I3X56UZ9U6rOK-gPL0kKs</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Boosting Segment Anything Model Towards Open-Vocabulary Learning</title><source>arXiv.org</source><creator>Han, Xumeng ; Wei, Longhui ; Yu, Xuehui ; Dou, Zhiyang ; He, Xin ; Wang, Kuiran ; Han, Zhenjun ; Tian, Qi</creator><creatorcontrib>Han, Xumeng ; Wei, Longhui ; Yu, Xuehui ; Dou, Zhiyang ; He, Xin ; Wang, Kuiran ; Han, Zhenjun ; Tian, Qi</creatorcontrib><description>The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.</description><identifier>DOI: 10.48550/arxiv.2312.03628</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.03628$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.03628$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Han, Xumeng</creatorcontrib><creatorcontrib>Wei, Longhui</creatorcontrib><creatorcontrib>Yu, Xuehui</creatorcontrib><creatorcontrib>Dou, Zhiyang</creatorcontrib><creatorcontrib>He, Xin</creatorcontrib><creatorcontrib>Wang, Kuiran</creatorcontrib><creatorcontrib>Han, Zhenjun</creatorcontrib><creatorcontrib>Tian, Qi</creatorcontrib><title>Boosting Segment Anything Model Towards Open-Vocabulary Learning</title><description>The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAr_QILj173dUSpeUlAXRN1GN47bRkrtygmP_j1py2qko9FoDmN3hcg1GiMeKP1237lUhcyFshKv2eNTjMPYhS3_9Nu9DyNfhOO4O4GP2PqeV_GHUjvw1cGHbB0dNV89pSMvPaUw1W7Y1Yb6wd_-54xVL8_V8i0rV6_vy0WZkQXMjJYO0FkU0GrjZAMGyVIxB7XxZGWjaa40WoHKNwKNBO1AGz3xokUANWP3l9mzQn1I3X56UZ9U6rOK-gPL0kKs</recordid><startdate>20231206</startdate><enddate>20231206</enddate><creator>Han, Xumeng</creator><creator>Wei, Longhui</creator><creator>Yu, Xuehui</creator><creator>Dou, Zhiyang</creator><creator>He, Xin</creator><creator>Wang, Kuiran</creator><creator>Han, Zhenjun</creator><creator>Tian, Qi</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231206</creationdate><title>Boosting Segment Anything Model Towards Open-Vocabulary Learning</title><author>Han, Xumeng ; Wei, Longhui ; Yu, Xuehui ; Dou, Zhiyang ; He, Xin ; Wang, Kuiran ; Han, Zhenjun ; Tian, Qi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-542c78c6807d45c2b758a6a1973fea62b4a93486083eb085274c7454b4a1d8773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Han, Xumeng</creatorcontrib><creatorcontrib>Wei, Longhui</creatorcontrib><creatorcontrib>Yu, Xuehui</creatorcontrib><creatorcontrib>Dou, Zhiyang</creatorcontrib><creatorcontrib>He, Xin</creatorcontrib><creatorcontrib>Wang, Kuiran</creatorcontrib><creatorcontrib>Han, Zhenjun</creatorcontrib><creatorcontrib>Tian, Qi</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Han, Xumeng</au><au>Wei, Longhui</au><au>Yu, Xuehui</au><au>Dou, Zhiyang</au><au>He, Xin</au><au>Wang, Kuiran</au><au>Han, Zhenjun</au><au>Tian, Qi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Boosting Segment Anything Model Towards Open-Vocabulary Learning</atitle><date>2023-12-06</date><risdate>2023</risdate><abstract>The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.</abstract><doi>10.48550/arxiv.2312.03628</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2312.03628
ispartof
issn
language eng
recordid cdi_arxiv_primary_2312_03628
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Boosting Segment Anything Model Towards Open-Vocabulary Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T08%3A22%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Boosting%20Segment%20Anything%20Model%20Towards%20Open-Vocabulary%20Learning&rft.au=Han,%20Xumeng&rft.date=2023-12-06&rft_id=info:doi/10.48550/arxiv.2312.03628&rft_dat=%3Carxiv_GOX%3E2312_03628%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true