MOFI: Learning Image Representations from Noisy Entity Annotated Images

We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wu, Wentao, Timofeev, Aleksei, Chen, Chen, Zhang, Bowen, Duan, Kun, Liu, Shuangning, Zheng, Yantao, Shlens, Jonathon, Du, Xianzhi, Gan, Zhe, Yang, Yinfei
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Wu, Wentao Timofeev, Aleksei Chen, Chen Zhang, Bowen Duan, Kun Liu, Shuangning Zheng, Yantao Shlens, Jonathon Du, Xianzhi Gan, Zhe Yang, Yinfei
description	We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.
doi_str_mv	10.48550/arxiv.2306.07952
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_07952</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_07952</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-b4c8a3938ac73c25db3d6d540fdd7d9c6d55c3b57a42c189e525a370975ff2e13</originalsourceid><addsrcrecordid>eNotj8tqwzAURLXpIqT9gKyqH7AjS76W1V0ISWpwGyjZm2s9gqCWg2RC_fdNk65mYDgDh5BVwfKyBmBrjD_-mnPBqpxJBXxBDh_HffNGW4sx-HCmzYBnS7_sJdpkw4STH0OiLo4D_Rx9mukuTH6a6SaE8bZa8yDSM3ly-J3sy38uyWm_O23fs_Z4aLabNsNK8qwvdY1CiRq1FJqD6YWpDJTMGSON0rcOWvQgseS6qJUFDigkUxKc47YQS_L6uL2bdJfoB4xz92fU3Y3EL_pcRgI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MOFI: Learning Image Representations from Noisy Entity Annotated Images</title><source>arXiv.org</source><creator>Wu, Wentao ; Timofeev, Aleksei ; Chen, Chen ; Zhang, Bowen ; Duan, Kun ; Liu, Shuangning ; Zheng, Yantao ; Shlens, Jonathon ; Du, Xianzhi ; Gan, Zhe ; Yang, Yinfei</creator><creatorcontrib>Wu, Wentao ; Timofeev, Aleksei ; Chen, Chen ; Zhang, Bowen ; Duan, Kun ; Liu, Shuangning ; Zheng, Yantao ; Shlens, Jonathon ; Du, Xianzhi ; Gan, Zhe ; Yang, Yinfei</creatorcontrib><description>We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.</description><identifier>DOI: 10.48550/arxiv.2306.07952</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2023-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.07952$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.07952$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wu, Wentao</creatorcontrib><creatorcontrib>Timofeev, Aleksei</creatorcontrib><creatorcontrib>Chen, Chen</creatorcontrib><creatorcontrib>Zhang, Bowen</creatorcontrib><creatorcontrib>Duan, Kun</creatorcontrib><creatorcontrib>Liu, Shuangning</creatorcontrib><creatorcontrib>Zheng, Yantao</creatorcontrib><creatorcontrib>Shlens, Jonathon</creatorcontrib><creatorcontrib>Du, Xianzhi</creatorcontrib><creatorcontrib>Gan, Zhe</creatorcontrib><creatorcontrib>Yang, Yinfei</creatorcontrib><title>MOFI: Learning Image Representations from Noisy Entity Annotated Images</title><description>We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tqwzAURLXpIqT9gKyqH7AjS76W1V0ISWpwGyjZm2s9gqCWg2RC_fdNk65mYDgDh5BVwfKyBmBrjD_-mnPBqpxJBXxBDh_HffNGW4sx-HCmzYBnS7_sJdpkw4STH0OiLo4D_Rx9mukuTH6a6SaE8bZa8yDSM3ly-J3sy38uyWm_O23fs_Z4aLabNsNK8qwvdY1CiRq1FJqD6YWpDJTMGSON0rcOWvQgseS6qJUFDigkUxKc47YQS_L6uL2bdJfoB4xz92fU3Y3EL_pcRgI</recordid><startdate>20230613</startdate><enddate>20230613</enddate><creator>Wu, Wentao</creator><creator>Timofeev, Aleksei</creator><creator>Chen, Chen</creator><creator>Zhang, Bowen</creator><creator>Duan, Kun</creator><creator>Liu, Shuangning</creator><creator>Zheng, Yantao</creator><creator>Shlens, Jonathon</creator><creator>Du, Xianzhi</creator><creator>Gan, Zhe</creator><creator>Yang, Yinfei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230613</creationdate><title>MOFI: Learning Image Representations from Noisy Entity Annotated Images</title><author>Wu, Wentao ; Timofeev, Aleksei ; Chen, Chen ; Zhang, Bowen ; Duan, Kun ; Liu, Shuangning ; Zheng, Yantao ; Shlens, Jonathon ; Du, Xianzhi ; Gan, Zhe ; Yang, Yinfei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-b4c8a3938ac73c25db3d6d540fdd7d9c6d55c3b57a42c189e525a370975ff2e13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Wu, Wentao</creatorcontrib><creatorcontrib>Timofeev, Aleksei</creatorcontrib><creatorcontrib>Chen, Chen</creatorcontrib><creatorcontrib>Zhang, Bowen</creatorcontrib><creatorcontrib>Duan, Kun</creatorcontrib><creatorcontrib>Liu, Shuangning</creatorcontrib><creatorcontrib>Zheng, Yantao</creatorcontrib><creatorcontrib>Shlens, Jonathon</creatorcontrib><creatorcontrib>Du, Xianzhi</creatorcontrib><creatorcontrib>Gan, Zhe</creatorcontrib><creatorcontrib>Yang, Yinfei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wu, Wentao</au><au>Timofeev, Aleksei</au><au>Chen, Chen</au><au>Zhang, Bowen</au><au>Duan, Kun</au><au>Liu, Shuangning</au><au>Zheng, Yantao</au><au>Shlens, Jonathon</au><au>Du, Xianzhi</au><au>Gan, Zhe</au><au>Yang, Yinfei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MOFI: Learning Image Representations from Noisy Entity Annotated Images</atitle><date>2023-06-13</date><risdate>2023</risdate><abstract>We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image. It's a simple, cost-effective method that can scale to handle billions of web-mined image-text pairs. Through this method, we have created Image-to-Entities (I2E), a new dataset with 1 billion images and 2 million distinct entities, covering rich visual concepts in the wild. Building upon the I2E dataset, we study different training recipes like supervised pre-training, contrastive pre-training, and multi-task learning. For contrastive pre-training, we treat entity names as free-form text, and further enrich them with entity descriptions. Experiments show that supervised pre-training with large-scale fine-grained entity labels is highly effective for image retrieval tasks, and multi-task training further improves the performance. The final MOFI model achieves 86.66% mAP on the challenging GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19% from OpenAI's CLIP model. Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations. We release our code and model weights at https://github.com/apple/ml-mofi.</abstract><doi>10.48550/arxiv.2306.07952</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2306.07952
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2306_07952
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
title	MOFI: Learning Image Representations from Noisy Entity Annotated Images
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T20%3A39%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MOFI:%20Learning%20Image%20Representations%20from%20Noisy%20Entity%20Annotated%20Images&rft.au=Wu,%20Wentao&rft.date=2023-06-13&rft_id=info:doi/10.48550/arxiv.2306.07952&rft_dat=%3Carxiv_GOX%3E2306_07952%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true