Boosting Segment Anything Model Towards Open-Vocabulary Learning
The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object sema...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Han, Xumeng Wei, Longhui Yu, Xuehui Dou, Zhiyang He, Xin Wang, Kuiran Han, Zhenjun Tian, Qi |
description | The recent Segment Anything Model (SAM) has emerged as a new paradigmatic
vision foundation model, showcasing potent zero-shot generalization and
flexible prompting. Despite SAM finding applications and adaptations in various
domains, its primary limitation lies in the inability to grasp object
semantics. In this paper, we present Sambor to seamlessly integrate SAM with
the open-vocabulary object detector in an end-to-end framework. While retaining
all the remarkable capabilities inherent to SAM, we enhance it with the
capacity to detect arbitrary objects based on human inputs like category names
or reference expressions. To accomplish this, we introduce a novel SideFormer
module that extracts SAM features to facilitate zero-shot object localization
and inject comprehensive semantic information for open-vocabulary recognition.
In addition, we devise an open-set region proposal network (Open-set RPN),
enabling the detector to acquire the open-set proposals generated by SAM.
Sambor demonstrates superior zero-shot performance across benchmarks, including
COCO and LVIS, proving highly competitive against previous SoTA methods. We
aspire for this work to serve as a meaningful endeavor in endowing SAM to
recognize diverse object categories and advancing open-vocabulary learning with
the support of vision foundation models. |
doi_str_mv | 10.48550/arxiv.2312.03628 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_03628</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_03628</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-542c78c6807d45c2b758a6a1973fea62b4a93486083eb085274c7454b4a1d8773</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAr_QILj173dUSpeUlAXRN1GN47bRkrtygmP_j1py2qko9FoDmN3hcg1GiMeKP1237lUhcyFshKv2eNTjMPYhS3_9Nu9DyNfhOO4O4GP2PqeV_GHUjvw1cGHbB0dNV89pSMvPaUw1W7Y1Yb6wd_-54xVL8_V8i0rV6_vy0WZkQXMjJYO0FkU0GrjZAMGyVIxB7XxZGWjaa40WoHKNwKNBO1AGz3xokUANWP3l9mzQn1I3X56UZ9U6rOK-gPL0kKs</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Boosting Segment Anything Model Towards Open-Vocabulary Learning</title><source>arXiv.org</source><creator>Han, Xumeng ; Wei, Longhui ; Yu, Xuehui ; Dou, Zhiyang ; He, Xin ; Wang, Kuiran ; Han, Zhenjun ; Tian, Qi</creator><creatorcontrib>Han, Xumeng ; Wei, Longhui ; Yu, Xuehui ; Dou, Zhiyang ; He, Xin ; Wang, Kuiran ; Han, Zhenjun ; Tian, Qi</creatorcontrib><description>The recent Segment Anything Model (SAM) has emerged as a new paradigmatic
vision foundation model, showcasing potent zero-shot generalization and
flexible prompting. Despite SAM finding applications and adaptations in various
domains, its primary limitation lies in the inability to grasp object
semantics. In this paper, we present Sambor to seamlessly integrate SAM with
the open-vocabulary object detector in an end-to-end framework. While retaining
all the remarkable capabilities inherent to SAM, we enhance it with the
capacity to detect arbitrary objects based on human inputs like category names
or reference expressions. To accomplish this, we introduce a novel SideFormer
module that extracts SAM features to facilitate zero-shot object localization
and inject comprehensive semantic information for open-vocabulary recognition.
In addition, we devise an open-set region proposal network (Open-set RPN),
enabling the detector to acquire the open-set proposals generated by SAM.
Sambor demonstrates superior zero-shot performance across benchmarks, including
COCO and LVIS, proving highly competitive against previous SoTA methods. We
aspire for this work to serve as a meaningful endeavor in endowing SAM to
recognize diverse object categories and advancing open-vocabulary learning with
the support of vision foundation models.</description><identifier>DOI: 10.48550/arxiv.2312.03628</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.03628$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.03628$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Han, Xumeng</creatorcontrib><creatorcontrib>Wei, Longhui</creatorcontrib><creatorcontrib>Yu, Xuehui</creatorcontrib><creatorcontrib>Dou, Zhiyang</creatorcontrib><creatorcontrib>He, Xin</creatorcontrib><creatorcontrib>Wang, Kuiran</creatorcontrib><creatorcontrib>Han, Zhenjun</creatorcontrib><creatorcontrib>Tian, Qi</creatorcontrib><title>Boosting Segment Anything Model Towards Open-Vocabulary Learning</title><description>The recent Segment Anything Model (SAM) has emerged as a new paradigmatic
vision foundation model, showcasing potent zero-shot generalization and
flexible prompting. Despite SAM finding applications and adaptations in various
domains, its primary limitation lies in the inability to grasp object
semantics. In this paper, we present Sambor to seamlessly integrate SAM with
the open-vocabulary object detector in an end-to-end framework. While retaining
all the remarkable capabilities inherent to SAM, we enhance it with the
capacity to detect arbitrary objects based on human inputs like category names
or reference expressions. To accomplish this, we introduce a novel SideFormer
module that extracts SAM features to facilitate zero-shot object localization
and inject comprehensive semantic information for open-vocabulary recognition.
In addition, we devise an open-set region proposal network (Open-set RPN),
enabling the detector to acquire the open-set proposals generated by SAM.
Sambor demonstrates superior zero-shot performance across benchmarks, including
COCO and LVIS, proving highly competitive against previous SoTA methods. We
aspire for this work to serve as a meaningful endeavor in endowing SAM to
recognize diverse object categories and advancing open-vocabulary learning with
the support of vision foundation models.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAr_QILj173dUSpeUlAXRN1GN47bRkrtygmP_j1py2qko9FoDmN3hcg1GiMeKP1237lUhcyFshKv2eNTjMPYhS3_9Nu9DyNfhOO4O4GP2PqeV_GHUjvw1cGHbB0dNV89pSMvPaUw1W7Y1Yb6wd_-54xVL8_V8i0rV6_vy0WZkQXMjJYO0FkU0GrjZAMGyVIxB7XxZGWjaa40WoHKNwKNBO1AGz3xokUANWP3l9mzQn1I3X56UZ9U6rOK-gPL0kKs</recordid><startdate>20231206</startdate><enddate>20231206</enddate><creator>Han, Xumeng</creator><creator>Wei, Longhui</creator><creator>Yu, Xuehui</creator><creator>Dou, Zhiyang</creator><creator>He, Xin</creator><creator>Wang, Kuiran</creator><creator>Han, Zhenjun</creator><creator>Tian, Qi</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231206</creationdate><title>Boosting Segment Anything Model Towards Open-Vocabulary Learning</title><author>Han, Xumeng ; Wei, Longhui ; Yu, Xuehui ; Dou, Zhiyang ; He, Xin ; Wang, Kuiran ; Han, Zhenjun ; Tian, Qi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-542c78c6807d45c2b758a6a1973fea62b4a93486083eb085274c7454b4a1d8773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Han, Xumeng</creatorcontrib><creatorcontrib>Wei, Longhui</creatorcontrib><creatorcontrib>Yu, Xuehui</creatorcontrib><creatorcontrib>Dou, Zhiyang</creatorcontrib><creatorcontrib>He, Xin</creatorcontrib><creatorcontrib>Wang, Kuiran</creatorcontrib><creatorcontrib>Han, Zhenjun</creatorcontrib><creatorcontrib>Tian, Qi</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Han, Xumeng</au><au>Wei, Longhui</au><au>Yu, Xuehui</au><au>Dou, Zhiyang</au><au>He, Xin</au><au>Wang, Kuiran</au><au>Han, Zhenjun</au><au>Tian, Qi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Boosting Segment Anything Model Towards Open-Vocabulary Learning</atitle><date>2023-12-06</date><risdate>2023</risdate><abstract>The recent Segment Anything Model (SAM) has emerged as a new paradigmatic
vision foundation model, showcasing potent zero-shot generalization and
flexible prompting. Despite SAM finding applications and adaptations in various
domains, its primary limitation lies in the inability to grasp object
semantics. In this paper, we present Sambor to seamlessly integrate SAM with
the open-vocabulary object detector in an end-to-end framework. While retaining
all the remarkable capabilities inherent to SAM, we enhance it with the
capacity to detect arbitrary objects based on human inputs like category names
or reference expressions. To accomplish this, we introduce a novel SideFormer
module that extracts SAM features to facilitate zero-shot object localization
and inject comprehensive semantic information for open-vocabulary recognition.
In addition, we devise an open-set region proposal network (Open-set RPN),
enabling the detector to acquire the open-set proposals generated by SAM.
Sambor demonstrates superior zero-shot performance across benchmarks, including
COCO and LVIS, proving highly competitive against previous SoTA methods. We
aspire for this work to serve as a meaningful endeavor in endowing SAM to
recognize diverse object categories and advancing open-vocabulary learning with
the support of vision foundation models.</abstract><doi>10.48550/arxiv.2312.03628</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2312.03628 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2312_03628 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | Boosting Segment Anything Model Towards Open-Vocabulary Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T08%3A22%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Boosting%20Segment%20Anything%20Model%20Towards%20Open-Vocabulary%20Learning&rft.au=Han,%20Xumeng&rft.date=2023-12-06&rft_id=info:doi/10.48550/arxiv.2312.03628&rft_dat=%3Carxiv_GOX%3E2312_03628%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |