EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Xiong, Yunyang, Varadarajan, Bala, Wu, Lemeng, Xiang, Xiaoyu, Xiao, Fanyi, Zhu, Chenchen, Dai, Xiaoliang, Wang, Dilin, Sun, Fei, Iandola, Forrest, Krishnamoorthi, Raghuraman, Chandra, Vikas
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Xiong, Yunyang Varadarajan, Bala Wu, Lemeng Xiang, Xiaoyu Xiao, Fanyi Zhu, Chenchen Dai, Xiaoliang Wang, Dilin Sun, Fei Iandola, Forrest Krishnamoorthi, Raghuraman Chandra, Vikas
description	Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.
doi_str_mv	10.48550/arxiv.2312.00863
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_00863</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_00863</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-facd7ae17ea992df8a41c86cb24757f577b016e8a4bebf14842c82731935a5dd3</originalsourceid><addsrcrecordid>eNo9j81Kw0AUhWfjQlofwJXzAonzm5l0F0q1hZQK7T7czNyJgybKNBT79qatuPo493AufIQ8cpYrqzV7hvQTT7mQXOSM2ULek90qhOgiDuO-2i5ojSdM0KGnWzh-TNj0U6JvCccEcYhDR8NXov8juseuv7AazuP7VM_JXYDPIz78cUYOL6vDcp3Vu9fNsqozKIzMAjhvALlBKEvhgwXFnS1cK5TRJmhjWsYLnM4ttoErq4SzwkheSg3aezkjT7e3V6PmO8Ue0rm5mDVXM_kLtDRJEQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything</title><source>arXiv.org</source><creator>Xiong, Yunyang ; Varadarajan, Bala ; Wu, Lemeng ; Xiang, Xiaoyu ; Xiao, Fanyi ; Zhu, Chenchen ; Dai, Xiaoliang ; Wang, Dilin ; Sun, Fei ; Iandola, Forrest ; Krishnamoorthi, Raghuraman ; Chandra, Vikas</creator><creatorcontrib>Xiong, Yunyang ; Varadarajan, Bala ; Wu, Lemeng ; Xiang, Xiaoyu ; Xiao, Fanyi ; Zhu, Chenchen ; Dai, Xiaoliang ; Wang, Dilin ; Sun, Fei ; Iandola, Forrest ; Krishnamoorthi, Raghuraman ; Chandra, Vikas</creatorcontrib><description>Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.</description><identifier>DOI: 10.48550/arxiv.2312.00863</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.00863$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.00863$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Varadarajan, Bala</creatorcontrib><creatorcontrib>Wu, Lemeng</creatorcontrib><creatorcontrib>Xiang, Xiaoyu</creatorcontrib><creatorcontrib>Xiao, Fanyi</creatorcontrib><creatorcontrib>Zhu, Chenchen</creatorcontrib><creatorcontrib>Dai, Xiaoliang</creatorcontrib><creatorcontrib>Wang, Dilin</creatorcontrib><creatorcontrib>Sun, Fei</creatorcontrib><creatorcontrib>Iandola, Forrest</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><title>EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything</title><description>Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo9j81Kw0AUhWfjQlofwJXzAonzm5l0F0q1hZQK7T7czNyJgybKNBT79qatuPo493AufIQ8cpYrqzV7hvQTT7mQXOSM2ULek90qhOgiDuO-2i5ojSdM0KGnWzh-TNj0U6JvCccEcYhDR8NXov8juseuv7AazuP7VM_JXYDPIz78cUYOL6vDcp3Vu9fNsqozKIzMAjhvALlBKEvhgwXFnS1cK5TRJmhjWsYLnM4ttoErq4SzwkheSg3aezkjT7e3V6PmO8Ue0rm5mDVXM_kLtDRJEQ</recordid><startdate>20231201</startdate><enddate>20231201</enddate><creator>Xiong, Yunyang</creator><creator>Varadarajan, Bala</creator><creator>Wu, Lemeng</creator><creator>Xiang, Xiaoyu</creator><creator>Xiao, Fanyi</creator><creator>Zhu, Chenchen</creator><creator>Dai, Xiaoliang</creator><creator>Wang, Dilin</creator><creator>Sun, Fei</creator><creator>Iandola, Forrest</creator><creator>Krishnamoorthi, Raghuraman</creator><creator>Chandra, Vikas</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231201</creationdate><title>EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything</title><author>Xiong, Yunyang ; Varadarajan, Bala ; Wu, Lemeng ; Xiang, Xiaoyu ; Xiao, Fanyi ; Zhu, Chenchen ; Dai, Xiaoliang ; Wang, Dilin ; Sun, Fei ; Iandola, Forrest ; Krishnamoorthi, Raghuraman ; Chandra, Vikas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-facd7ae17ea992df8a41c86cb24757f577b016e8a4bebf14842c82731935a5dd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Varadarajan, Bala</creatorcontrib><creatorcontrib>Wu, Lemeng</creatorcontrib><creatorcontrib>Xiang, Xiaoyu</creatorcontrib><creatorcontrib>Xiao, Fanyi</creatorcontrib><creatorcontrib>Zhu, Chenchen</creatorcontrib><creatorcontrib>Dai, Xiaoliang</creatorcontrib><creatorcontrib>Wang, Dilin</creatorcontrib><creatorcontrib>Sun, Fei</creatorcontrib><creatorcontrib>Iandola, Forrest</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xiong, Yunyang</au><au>Varadarajan, Bala</au><au>Wu, Lemeng</au><au>Xiang, Xiaoyu</au><au>Xiao, Fanyi</au><au>Zhu, Chenchen</au><au>Dai, Xiaoliang</au><au>Wang, Dilin</au><au>Sun, Fei</au><au>Iandola, Forrest</au><au>Krishnamoorthi, Raghuraman</au><au>Chandra, Vikas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything</atitle><date>2023-12-01</date><risdate>2023</risdate><abstract>Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.</abstract><doi>10.48550/arxiv.2312.00863</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2312.00863
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2312_00863
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T19%3A09%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=EfficientSAM:%20Leveraged%20Masked%20Image%20Pretraining%20for%20Efficient%20Segment%20Anything&rft.au=Xiong,%20Yunyang&rft.date=2023-12-01&rft_id=info:doi/10.48550/arxiv.2312.00863&rft_dat=%3Carxiv_GOX%3E2312_00863%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true