MOFI: Learning Image Representations from Noisy Entity Annotated Images
We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatic...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Wu, Wentao Timofeev, Aleksei Chen, Chen Zhang, Bowen Duan, Kun Liu, Shuangning Zheng, Yantao Shlens, Jonathon Du, Xianzhi Gan, Zhe Yang, Yinfei |
description | We present MOFI, Manifold OF Images, a new vision foundation model designed
to learn image representations from noisy entity annotated images. MOFI differs
from previous work in two key aspects: (i) pre-training data, and (ii) training
recipe. Regarding data, we introduce a new approach to automatically assign
entity labels to images from noisy image-text pairs. Our approach involves
employing a named entity recognition model to extract entities from the
alt-text, and then using a CLIP model to select the correct entities as labels
of the paired image. It's a simple, cost-effective method that can scale to
handle billions of web-mined image-text pairs. Through this method, we have
created Image-to-Entities (I2E), a new dataset with 1 billion images and 2
million distinct entities, covering rich visual concepts in the wild. Building
upon the I2E dataset, we study different training recipes like supervised
pre-training, contrastive pre-training, and multi-task learning. For
contrastive pre-training, we treat entity names as free-form text, and further
enrich them with entity descriptions. Experiments show that supervised
pre-training with large-scale fine-grained entity labels is highly effective
for image retrieval tasks, and multi-task training further improves the
performance. The final MOFI model achieves 86.66% mAP on the challenging
GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19%
from OpenAI's CLIP model. Further experiments on zero-shot and linear probe
image classification also show that MOFI outperforms a CLIP model trained on
the original image-text data, demonstrating the effectiveness of the I2E
dataset in learning strong image representations. We release our code and model
weights at https://github.com/apple/ml-mofi. |
doi_str_mv | 10.48550/arxiv.2306.07952 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_07952</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_07952</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-b4c8a3938ac73c25db3d6d540fdd7d9c6d55c3b57a42c189e525a370975ff2e13</originalsourceid><addsrcrecordid>eNotj8tqwzAURLXpIqT9gKyqH7AjS76W1V0ISWpwGyjZm2s9gqCWg2RC_fdNk65mYDgDh5BVwfKyBmBrjD_-mnPBqpxJBXxBDh_HffNGW4sx-HCmzYBnS7_sJdpkw4STH0OiLo4D_Rx9mukuTH6a6SaE8bZa8yDSM3ly-J3sy38uyWm_O23fs_Z4aLabNsNK8qwvdY1CiRq1FJqD6YWpDJTMGSON0rcOWvQgseS6qJUFDigkUxKc47YQS_L6uL2bdJfoB4xz92fU3Y3EL_pcRgI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MOFI: Learning Image Representations from Noisy Entity Annotated Images</title><source>arXiv.org</source><creator>Wu, Wentao ; Timofeev, Aleksei ; Chen, Chen ; Zhang, Bowen ; Duan, Kun ; Liu, Shuangning ; Zheng, Yantao ; Shlens, Jonathon ; Du, Xianzhi ; Gan, Zhe ; Yang, Yinfei</creator><creatorcontrib>Wu, Wentao ; Timofeev, Aleksei ; Chen, Chen ; Zhang, Bowen ; Duan, Kun ; Liu, Shuangning ; Zheng, Yantao ; Shlens, Jonathon ; Du, Xianzhi ; Gan, Zhe ; Yang, Yinfei</creatorcontrib><description>We present MOFI, Manifold OF Images, a new vision foundation model designed
to learn image representations from noisy entity annotated images. MOFI differs
from previous work in two key aspects: (i) pre-training data, and (ii) training
recipe. Regarding data, we introduce a new approach to automatically assign
entity labels to images from noisy image-text pairs. Our approach involves
employing a named entity recognition model to extract entities from the
alt-text, and then using a CLIP model to select the correct entities as labels
of the paired image. It's a simple, cost-effective method that can scale to
handle billions of web-mined image-text pairs. Through this method, we have
created Image-to-Entities (I2E), a new dataset with 1 billion images and 2
million distinct entities, covering rich visual concepts in the wild. Building
upon the I2E dataset, we study different training recipes like supervised
pre-training, contrastive pre-training, and multi-task learning. For
contrastive pre-training, we treat entity names as free-form text, and further
enrich them with entity descriptions. Experiments show that supervised
pre-training with large-scale fine-grained entity labels is highly effective
for image retrieval tasks, and multi-task training further improves the
performance. The final MOFI model achieves 86.66% mAP on the challenging
GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19%
from OpenAI's CLIP model. Further experiments on zero-shot and linear probe
image classification also show that MOFI outperforms a CLIP model trained on
the original image-text data, demonstrating the effectiveness of the I2E
dataset in learning strong image representations. We release our code and model
weights at https://github.com/apple/ml-mofi.</description><identifier>DOI: 10.48550/arxiv.2306.07952</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2023-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.07952$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.07952$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wu, Wentao</creatorcontrib><creatorcontrib>Timofeev, Aleksei</creatorcontrib><creatorcontrib>Chen, Chen</creatorcontrib><creatorcontrib>Zhang, Bowen</creatorcontrib><creatorcontrib>Duan, Kun</creatorcontrib><creatorcontrib>Liu, Shuangning</creatorcontrib><creatorcontrib>Zheng, Yantao</creatorcontrib><creatorcontrib>Shlens, Jonathon</creatorcontrib><creatorcontrib>Du, Xianzhi</creatorcontrib><creatorcontrib>Gan, Zhe</creatorcontrib><creatorcontrib>Yang, Yinfei</creatorcontrib><title>MOFI: Learning Image Representations from Noisy Entity Annotated Images</title><description>We present MOFI, Manifold OF Images, a new vision foundation model designed
to learn image representations from noisy entity annotated images. MOFI differs
from previous work in two key aspects: (i) pre-training data, and (ii) training
recipe. Regarding data, we introduce a new approach to automatically assign
entity labels to images from noisy image-text pairs. Our approach involves
employing a named entity recognition model to extract entities from the
alt-text, and then using a CLIP model to select the correct entities as labels
of the paired image. It's a simple, cost-effective method that can scale to
handle billions of web-mined image-text pairs. Through this method, we have
created Image-to-Entities (I2E), a new dataset with 1 billion images and 2
million distinct entities, covering rich visual concepts in the wild. Building
upon the I2E dataset, we study different training recipes like supervised
pre-training, contrastive pre-training, and multi-task learning. For
contrastive pre-training, we treat entity names as free-form text, and further
enrich them with entity descriptions. Experiments show that supervised
pre-training with large-scale fine-grained entity labels is highly effective
for image retrieval tasks, and multi-task training further improves the
performance. The final MOFI model achieves 86.66% mAP on the challenging
GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19%
from OpenAI's CLIP model. Further experiments on zero-shot and linear probe
image classification also show that MOFI outperforms a CLIP model trained on
the original image-text data, demonstrating the effectiveness of the I2E
dataset in learning strong image representations. We release our code and model
weights at https://github.com/apple/ml-mofi.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tqwzAURLXpIqT9gKyqH7AjS76W1V0ISWpwGyjZm2s9gqCWg2RC_fdNk65mYDgDh5BVwfKyBmBrjD_-mnPBqpxJBXxBDh_HffNGW4sx-HCmzYBnS7_sJdpkw4STH0OiLo4D_Rx9mukuTH6a6SaE8bZa8yDSM3ly-J3sy38uyWm_O23fs_Z4aLabNsNK8qwvdY1CiRq1FJqD6YWpDJTMGSON0rcOWvQgseS6qJUFDigkUxKc47YQS_L6uL2bdJfoB4xz92fU3Y3EL_pcRgI</recordid><startdate>20230613</startdate><enddate>20230613</enddate><creator>Wu, Wentao</creator><creator>Timofeev, Aleksei</creator><creator>Chen, Chen</creator><creator>Zhang, Bowen</creator><creator>Duan, Kun</creator><creator>Liu, Shuangning</creator><creator>Zheng, Yantao</creator><creator>Shlens, Jonathon</creator><creator>Du, Xianzhi</creator><creator>Gan, Zhe</creator><creator>Yang, Yinfei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230613</creationdate><title>MOFI: Learning Image Representations from Noisy Entity Annotated Images</title><author>Wu, Wentao ; Timofeev, Aleksei ; Chen, Chen ; Zhang, Bowen ; Duan, Kun ; Liu, Shuangning ; Zheng, Yantao ; Shlens, Jonathon ; Du, Xianzhi ; Gan, Zhe ; Yang, Yinfei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-b4c8a3938ac73c25db3d6d540fdd7d9c6d55c3b57a42c189e525a370975ff2e13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Wu, Wentao</creatorcontrib><creatorcontrib>Timofeev, Aleksei</creatorcontrib><creatorcontrib>Chen, Chen</creatorcontrib><creatorcontrib>Zhang, Bowen</creatorcontrib><creatorcontrib>Duan, Kun</creatorcontrib><creatorcontrib>Liu, Shuangning</creatorcontrib><creatorcontrib>Zheng, Yantao</creatorcontrib><creatorcontrib>Shlens, Jonathon</creatorcontrib><creatorcontrib>Du, Xianzhi</creatorcontrib><creatorcontrib>Gan, Zhe</creatorcontrib><creatorcontrib>Yang, Yinfei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wu, Wentao</au><au>Timofeev, Aleksei</au><au>Chen, Chen</au><au>Zhang, Bowen</au><au>Duan, Kun</au><au>Liu, Shuangning</au><au>Zheng, Yantao</au><au>Shlens, Jonathon</au><au>Du, Xianzhi</au><au>Gan, Zhe</au><au>Yang, Yinfei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MOFI: Learning Image Representations from Noisy Entity Annotated Images</atitle><date>2023-06-13</date><risdate>2023</risdate><abstract>We present MOFI, Manifold OF Images, a new vision foundation model designed
to learn image representations from noisy entity annotated images. MOFI differs
from previous work in two key aspects: (i) pre-training data, and (ii) training
recipe. Regarding data, we introduce a new approach to automatically assign
entity labels to images from noisy image-text pairs. Our approach involves
employing a named entity recognition model to extract entities from the
alt-text, and then using a CLIP model to select the correct entities as labels
of the paired image. It's a simple, cost-effective method that can scale to
handle billions of web-mined image-text pairs. Through this method, we have
created Image-to-Entities (I2E), a new dataset with 1 billion images and 2
million distinct entities, covering rich visual concepts in the wild. Building
upon the I2E dataset, we study different training recipes like supervised
pre-training, contrastive pre-training, and multi-task learning. For
contrastive pre-training, we treat entity names as free-form text, and further
enrich them with entity descriptions. Experiments show that supervised
pre-training with large-scale fine-grained entity labels is highly effective
for image retrieval tasks, and multi-task training further improves the
performance. The final MOFI model achieves 86.66% mAP on the challenging
GPR1200 dataset, surpassing the previous state-of-the-art performance of 72.19%
from OpenAI's CLIP model. Further experiments on zero-shot and linear probe
image classification also show that MOFI outperforms a CLIP model trained on
the original image-text data, demonstrating the effectiveness of the I2E
dataset in learning strong image representations. We release our code and model
weights at https://github.com/apple/ml-mofi.</abstract><doi>10.48550/arxiv.2306.07952</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2306.07952 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2306_07952 |
source | arXiv.org |
subjects | Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning |
title | MOFI: Learning Image Representations from Noisy Entity Annotated Images |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T20%3A39%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MOFI:%20Learning%20Image%20Representations%20from%20Noisy%20Entity%20Annotated%20Images&rft.au=Wu,%20Wentao&rft.date=2023-06-13&rft_id=info:doi/10.48550/arxiv.2306.07952&rft_dat=%3Carxiv_GOX%3E2306_07952%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |