MOFI: Learning Image Representations from Noisy Entity Annotated Images

We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatic...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Wu, Wentao, Timofeev, Aleksei, Chen, Chen, Zhang, Bowen, Duan, Kun, Liu, Shuangning, Zheng, Yantao, Shlens, Jonathon, Du, Xianzhi, Gan, Zhe, Yang, Yinfei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Wu, Wentao
Timofeev, Aleksei
Chen, Chen
Zhang, Bowen
Duan, Kun
Liu, Shuangning
Zheng, Yantao
Shlens, Jonathon
Du, Xianzhi
Gan, Zhe
Yang, Yinfei
description We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.
doi_str_mv 10.48550/arxiv.2306.07952
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_07952</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_07952</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-b4c8a3938ac73c25db3d6d540fdd7d9c6d55c3b57a42c189e525a370975ff2e13</originalsourceid><addsrcrecordid>eNotj8tqwzAURLXpIqT9gKyqH7AjS76W1V0ISWpwGyjZm2s9gqCWg2RC_fdNk65mYDgDh5BVwfKyBmBrjD_-mnPBqpxJBXxBDh_HffNGW4sx-HCmzYBnS7_sJdpkw4STH0OiLo4D_Rx9mukuTH6a6SaE8bZa8yDSM3ly-J3sy38uyWm_O23fs_Z4aLabNsNK8qwvdY1CiRq1FJqD6YWpDJTMGSON0rcOWvQgseS6qJUFDigkUxKc47YQS_L6uL2bdJfoB4xz92fU3Y3EL_pcRgI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MOFI: Learning Image Representations from Noisy Entity Annotated Images</title><source>arXiv.org</source><creator>Wu, Wentao ; Timofeev, Aleksei ; Chen, Chen ; Zhang, Bowen ; Duan, Kun ; Liu, Shuangning ; Zheng, Yantao ; Shlens, Jonathon ; Du, Xianzhi ; Gan, Zhe ; Yang, Yinfei</creator><creatorcontrib>Wu, Wentao ; Timofeev, Aleksei ; Chen, Chen ; Zhang, Bowen ; Duan, Kun ; Liu, Shuangning ; Zheng, Yantao ; Shlens, Jonathon ; Du, Xianzhi ; Gan, Zhe ; Yang, Yinfei</creatorcontrib><description>We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.</description><identifier>DOI: 10.48550/arxiv.2306.07952</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2023-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.07952$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.07952$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wu, Wentao</creatorcontrib><creatorcontrib>Timofeev, Aleksei</creatorcontrib><creatorcontrib>Chen, Chen</creatorcontrib><creatorcontrib>Zhang, Bowen</creatorcontrib><creatorcontrib>Duan, Kun</creatorcontrib><creatorcontrib>Liu, Shuangning</creatorcontrib><creatorcontrib>Zheng, Yantao</creatorcontrib><creatorcontrib>Shlens, Jonathon</creatorcontrib><creatorcontrib>Du, Xianzhi</creatorcontrib><creatorcontrib>Gan, Zhe</creatorcontrib><creatorcontrib>Yang, Yinfei</creatorcontrib><title>MOFI: Learning Image Representations from Noisy Entity Annotated Images</title><description>We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tqwzAURLXpIqT9gKyqH7AjS76W1V0ISWpwGyjZm2s9gqCWg2RC_fdNk65mYDgDh5BVwfKyBmBrjD_-mnPBqpxJBXxBDh_HffNGW4sx-HCmzYBnS7_sJdpkw4STH0OiLo4D_Rx9mukuTH6a6SaE8bZa8yDSM3ly-J3sy38uyWm_O23fs_Z4aLabNsNK8qwvdY1CiRq1FJqD6YWpDJTMGSON0rcOWvQgseS6qJUFDigkUxKc47YQS_L6uL2bdJfoB4xz92fU3Y3EL_pcRgI</recordid><startdate>20230613</startdate><enddate>20230613</enddate><creator>Wu, Wentao</creator><creator>Timofeev, Aleksei</creator><creator>Chen, Chen</creator><creator>Zhang, Bowen</creator><creator>Duan, Kun</creator><creator>Liu, Shuangning</creator><creator>Zheng, Yantao</creator><creator>Shlens, Jonathon</creator><creator>Du, Xianzhi</creator><creator>Gan, Zhe</creator><creator>Yang, Yinfei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230613</creationdate><title>MOFI: Learning Image Representations from Noisy Entity Annotated Images</title><author>Wu, Wentao ; Timofeev, Aleksei ; Chen, Chen ; Zhang, Bowen ; Duan, Kun ; Liu, Shuangning ; Zheng, Yantao ; Shlens, Jonathon ; Du, Xianzhi ; Gan, Zhe ; Yang, Yinfei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-b4c8a3938ac73c25db3d6d540fdd7d9c6d55c3b57a42c189e525a370975ff2e13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Wu, Wentao</creatorcontrib><creatorcontrib>Timofeev, Aleksei</creatorcontrib><creatorcontrib>Chen, Chen</creatorcontrib><creatorcontrib>Zhang, Bowen</creatorcontrib><creatorcontrib>Duan, Kun</creatorcontrib><creatorcontrib>Liu, Shuangning</creatorcontrib><creatorcontrib>Zheng, Yantao</creatorcontrib><creatorcontrib>Shlens, Jonathon</creatorcontrib><creatorcontrib>Du, Xianzhi</creatorcontrib><creatorcontrib>Gan, Zhe</creatorcontrib><creatorcontrib>Yang, Yinfei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wu, Wentao</au><au>Timofeev, Aleksei</au><au>Chen, Chen</au><au>Zhang, Bowen</au><au>Duan, Kun</au><au>Liu, Shuangning</au><au>Zheng, Yantao</au><au>Shlens, Jonathon</au><au>Du, Xianzhi</au><au>Gan, Zhe</au><au>Yang, Yinfei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MOFI: Learning Image Representations from Noisy Entity Annotated Images</atitle><date>2023-06-13</date><risdate>2023</risdate><abstract>We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.</abstract><doi>10.48550/arxiv.2306.07952</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2306.07952
ispartof
issn
language eng
recordid cdi_arxiv_primary_2306_07952
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Learning
title MOFI: Learning Image Representations from Noisy Entity Annotated Images
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T20%3A39%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MOFI:%20Learning%20Image%20Representations%20from%20Noisy%20Entity%20Annotated%20Images&rft.au=Wu,%20Wentao&rft.date=2023-06-13&rft_id=info:doi/10.48550/arxiv.2306.07952&rft_dat=%3Carxiv_GOX%3E2306_07952%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true