Recognize Anything: A Strong Image Tagging Model

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leve...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhang, Youcai, Huang, Xinyu, Ma, Jinyu, Li, Zhaoyang, Luo, Zhaochuan, Xie, Yanchun, Qin, Yuzhuo, Luo, Tong, Li, Yaqian, Liu, Shilong, Guo, Yandong, Zhang, Lei
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhang, Youcai Huang, Xinyu Ma, Jinyu Li, Zhaoyang Luo, Zhaochuan Xie, Yanchun Qin, Yuzhuo Luo, Tong Li, Yaqian Liu, Shilong Guo, Yandong Zhang, Lei
description	We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.
doi_str_mv	10.48550/arxiv.2306.03514
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_03514</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_03514</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-c94f41562d11118ae6e044461681232cc5863af6593103fe30e43eb16b33eb743</originalsourceid><addsrcrecordid>eNotjrsKwkAQRbexkOgHWLk_kLibmR2jXRBfoAiaPqxxsgY0kSiifr3xcZsDtzgcIXpaBRgZowa2fhT3IARFgQKjsS3UlrPKlcWLZVw-b8eidGMZy92trkonl2frWCbWueaX6-rAp45o5fZ05e6fnkhm02Sy8Feb-XISr3xLQ_SzEeaoDYUH3SyyTKwQkTRFOoQwy0xEYHMyI9AKcgbFCLzXtIcGQwRP9H_ab3J6qYuzrZ_pJz39psMbQiQ7rA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Recognize Anything: A Strong Image Tagging Model</title><source>arXiv.org</source><creator>Zhang, Youcai ; Huang, Xinyu ; Ma, Jinyu ; Li, Zhaoyang ; Luo, Zhaochuan ; Xie, Yanchun ; Qin, Yuzhuo ; Luo, Tong ; Li, Yaqian ; Liu, Shilong ; Guo, Yandong ; Zhang, Lei</creator><creatorcontrib>Zhang, Youcai ; Huang, Xinyu ; Ma, Jinyu ; Li, Zhaoyang ; Luo, Zhaochuan ; Xie, Yanchun ; Qin, Yuzhuo ; Luo, Tong ; Li, Yaqian ; Liu, Shilong ; Guo, Yandong ; Zhang, Lei</creatorcontrib><description>We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.</description><identifier>DOI: 10.48550/arxiv.2306.03514</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.03514$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.03514$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Youcai</creatorcontrib><creatorcontrib>Huang, Xinyu</creatorcontrib><creatorcontrib>Ma, Jinyu</creatorcontrib><creatorcontrib>Li, Zhaoyang</creatorcontrib><creatorcontrib>Luo, Zhaochuan</creatorcontrib><creatorcontrib>Xie, Yanchun</creatorcontrib><creatorcontrib>Qin, Yuzhuo</creatorcontrib><creatorcontrib>Luo, Tong</creatorcontrib><creatorcontrib>Li, Yaqian</creatorcontrib><creatorcontrib>Liu, Shilong</creatorcontrib><creatorcontrib>Guo, Yandong</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><title>Recognize Anything: A Strong Image Tagging Model</title><description>We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotjrsKwkAQRbexkOgHWLk_kLibmR2jXRBfoAiaPqxxsgY0kSiifr3xcZsDtzgcIXpaBRgZowa2fhT3IARFgQKjsS3UlrPKlcWLZVw-b8eidGMZy92trkonl2frWCbWueaX6-rAp45o5fZ05e6fnkhm02Sy8Feb-XISr3xLQ_SzEeaoDYUH3SyyTKwQkTRFOoQwy0xEYHMyI9AKcgbFCLzXtIcGQwRP9H_ab3J6qYuzrZ_pJz39psMbQiQ7rA</recordid><startdate>20230606</startdate><enddate>20230606</enddate><creator>Zhang, Youcai</creator><creator>Huang, Xinyu</creator><creator>Ma, Jinyu</creator><creator>Li, Zhaoyang</creator><creator>Luo, Zhaochuan</creator><creator>Xie, Yanchun</creator><creator>Qin, Yuzhuo</creator><creator>Luo, Tong</creator><creator>Li, Yaqian</creator><creator>Liu, Shilong</creator><creator>Guo, Yandong</creator><creator>Zhang, Lei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230606</creationdate><title>Recognize Anything: A Strong Image Tagging Model</title><author>Zhang, Youcai ; Huang, Xinyu ; Ma, Jinyu ; Li, Zhaoyang ; Luo, Zhaochuan ; Xie, Yanchun ; Qin, Yuzhuo ; Luo, Tong ; Li, Yaqian ; Liu, Shilong ; Guo, Yandong ; Zhang, Lei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-c94f41562d11118ae6e044461681232cc5863af6593103fe30e43eb16b33eb743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Youcai</creatorcontrib><creatorcontrib>Huang, Xinyu</creatorcontrib><creatorcontrib>Ma, Jinyu</creatorcontrib><creatorcontrib>Li, Zhaoyang</creatorcontrib><creatorcontrib>Luo, Zhaochuan</creatorcontrib><creatorcontrib>Xie, Yanchun</creatorcontrib><creatorcontrib>Qin, Yuzhuo</creatorcontrib><creatorcontrib>Luo, Tong</creatorcontrib><creatorcontrib>Li, Yaqian</creatorcontrib><creatorcontrib>Liu, Shilong</creatorcontrib><creatorcontrib>Guo, Yandong</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Youcai</au><au>Huang, Xinyu</au><au>Ma, Jinyu</au><au>Li, Zhaoyang</au><au>Luo, Zhaochuan</au><au>Xie, Yanchun</au><au>Qin, Yuzhuo</au><au>Luo, Tong</au><au>Li, Yaqian</au><au>Liu, Shilong</au><au>Guo, Yandong</au><au>Zhang, Lei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Recognize Anything: A Strong Image Tagging Model</atitle><date>2023-06-06</date><risdate>2023</risdate><abstract>We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.</abstract><doi>10.48550/arxiv.2306.03514</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2306.03514
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2306_03514
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Recognize Anything: A Strong Image Tagging Model
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T23%3A10%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Recognize%20Anything:%20A%20Strong%20Image%20Tagging%20Model&rft.au=Zhang,%20Youcai&rft.date=2023-06-06&rft_id=info:doi/10.48550/arxiv.2306.03514&rft_dat=%3Carxiv_GOX%3E2306_03514%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true