Recognize Anything: A Strong Image Tagging Model

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leve...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zhang, Youcai, Huang, Xinyu, Ma, Jinyu, Li, Zhaoyang, Luo, Zhaochuan, Xie, Yanchun, Qin, Yuzhuo, Luo, Tong, Li, Yaqian, Liu, Shilong, Guo, Yandong, Zhang, Lei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Zhang, Youcai
Huang, Xinyu
Ma, Jinyu
Li, Zhaoyang
Luo, Zhaochuan
Xie, Yanchun
Qin, Yuzhuo
Luo, Tong
Li, Yaqian
Liu, Shilong
Guo, Yandong
Zhang, Lei
description We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.
doi_str_mv 10.48550/arxiv.2306.03514
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_03514</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_03514</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-c94f41562d11118ae6e044461681232cc5863af6593103fe30e43eb16b33eb743</originalsourceid><addsrcrecordid>eNotjrsKwkAQRbexkOgHWLk_kLibmR2jXRBfoAiaPqxxsgY0kSiifr3xcZsDtzgcIXpaBRgZowa2fhT3IARFgQKjsS3UlrPKlcWLZVw-b8eidGMZy92trkonl2frWCbWueaX6-rAp45o5fZ05e6fnkhm02Sy8Feb-XISr3xLQ_SzEeaoDYUH3SyyTKwQkTRFOoQwy0xEYHMyI9AKcgbFCLzXtIcGQwRP9H_ab3J6qYuzrZ_pJz39psMbQiQ7rA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Recognize Anything: A Strong Image Tagging Model</title><source>arXiv.org</source><creator>Zhang, Youcai ; Huang, Xinyu ; Ma, Jinyu ; Li, Zhaoyang ; Luo, Zhaochuan ; Xie, Yanchun ; Qin, Yuzhuo ; Luo, Tong ; Li, Yaqian ; Liu, Shilong ; Guo, Yandong ; Zhang, Lei</creator><creatorcontrib>Zhang, Youcai ; Huang, Xinyu ; Ma, Jinyu ; Li, Zhaoyang ; Luo, Zhaochuan ; Xie, Yanchun ; Qin, Yuzhuo ; Luo, Tong ; Li, Yaqian ; Liu, Shilong ; Guo, Yandong ; Zhang, Lei</creatorcontrib><description>We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.</description><identifier>DOI: 10.48550/arxiv.2306.03514</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.03514$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.03514$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Youcai</creatorcontrib><creatorcontrib>Huang, Xinyu</creatorcontrib><creatorcontrib>Ma, Jinyu</creatorcontrib><creatorcontrib>Li, Zhaoyang</creatorcontrib><creatorcontrib>Luo, Zhaochuan</creatorcontrib><creatorcontrib>Xie, Yanchun</creatorcontrib><creatorcontrib>Qin, Yuzhuo</creatorcontrib><creatorcontrib>Luo, Tong</creatorcontrib><creatorcontrib>Li, Yaqian</creatorcontrib><creatorcontrib>Liu, Shilong</creatorcontrib><creatorcontrib>Guo, Yandong</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><title>Recognize Anything: A Strong Image Tagging Model</title><description>We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotjrsKwkAQRbexkOgHWLk_kLibmR2jXRBfoAiaPqxxsgY0kSiifr3xcZsDtzgcIXpaBRgZowa2fhT3IARFgQKjsS3UlrPKlcWLZVw-b8eidGMZy92trkonl2frWCbWueaX6-rAp45o5fZ05e6fnkhm02Sy8Feb-XISr3xLQ_SzEeaoDYUH3SyyTKwQkTRFOoQwy0xEYHMyI9AKcgbFCLzXtIcGQwRP9H_ab3J6qYuzrZ_pJz39psMbQiQ7rA</recordid><startdate>20230606</startdate><enddate>20230606</enddate><creator>Zhang, Youcai</creator><creator>Huang, Xinyu</creator><creator>Ma, Jinyu</creator><creator>Li, Zhaoyang</creator><creator>Luo, Zhaochuan</creator><creator>Xie, Yanchun</creator><creator>Qin, Yuzhuo</creator><creator>Luo, Tong</creator><creator>Li, Yaqian</creator><creator>Liu, Shilong</creator><creator>Guo, Yandong</creator><creator>Zhang, Lei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230606</creationdate><title>Recognize Anything: A Strong Image Tagging Model</title><author>Zhang, Youcai ; Huang, Xinyu ; Ma, Jinyu ; Li, Zhaoyang ; Luo, Zhaochuan ; Xie, Yanchun ; Qin, Yuzhuo ; Luo, Tong ; Li, Yaqian ; Liu, Shilong ; Guo, Yandong ; Zhang, Lei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-c94f41562d11118ae6e044461681232cc5863af6593103fe30e43eb16b33eb743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Youcai</creatorcontrib><creatorcontrib>Huang, Xinyu</creatorcontrib><creatorcontrib>Ma, Jinyu</creatorcontrib><creatorcontrib>Li, Zhaoyang</creatorcontrib><creatorcontrib>Luo, Zhaochuan</creatorcontrib><creatorcontrib>Xie, Yanchun</creatorcontrib><creatorcontrib>Qin, Yuzhuo</creatorcontrib><creatorcontrib>Luo, Tong</creatorcontrib><creatorcontrib>Li, Yaqian</creatorcontrib><creatorcontrib>Liu, Shilong</creatorcontrib><creatorcontrib>Guo, Yandong</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Youcai</au><au>Huang, Xinyu</au><au>Ma, Jinyu</au><au>Li, Zhaoyang</au><au>Luo, Zhaochuan</au><au>Xie, Yanchun</au><au>Qin, Yuzhuo</au><au>Luo, Tong</au><au>Li, Yaqian</au><au>Liu, Shilong</au><au>Guo, Yandong</au><au>Zhang, Lei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Recognize Anything: A Strong Image Tagging Model</atitle><date>2023-06-06</date><risdate>2023</risdate><abstract>We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset. We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API. We are releasing the RAM at \url{https://recognize-anything.github.io/} to foster the advancements of large models in computer vision.</abstract><doi>10.48550/arxiv.2306.03514</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2306.03514
ispartof
issn
language eng
recordid cdi_arxiv_primary_2306_03514
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Recognize Anything: A Strong Image Tagging Model
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T23%3A10%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Recognize%20Anything:%20A%20Strong%20Image%20Tagging%20Model&rft.au=Zhang,%20Youcai&rft.date=2023-06-06&rft_id=info:doi/10.48550/arxiv.2306.03514&rft_dat=%3Carxiv_GOX%3E2306_03514%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true