Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models
Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly m...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Meng, Chutian Ma, Fan Miao, Jiaxu Zhang, Chi Yang, Yi Zhuang, Yueting |
description | Diffusion models have revitalized the image generation domain, playing
crucial roles in both academic research and artistic expression. With the
emergence of new diffusion models, assessing the performance of text-to-image
models has become increasingly important. Current metrics focus on directly
matching the input text with the generated image, but due to cross-modal
information asymmetry, this leads to unreliable or incomplete assessment
results. Motivated by this, we introduce the Image Regeneration task in this
study to assess text-to-image models by tasking the T2I model with generating
an image according to the reference image. We use GPT4V to bridge the gap
between the reference image and the text input for the T2I model, allowing T2I
models to understand image content. This evaluation process is simplified as
comparisons between the generated image and the reference image are
straightforward. Two regeneration datasets spanning content-diverse and
style-diverse evaluation dataset are introduced to evaluate the leading
diffusion models currently available. Additionally, we present ImageRepainter
framework to enhance the quality of generated images by improving content
comprehension via MLLM guided iterative generation and revision. Our
comprehensive experiments have showcased the effectiveness of this framework in
assessing the generative capabilities of models. By leveraging MLLM, we have
demonstrated that a robust T2M can produce images more closely resembling the
reference image. |
doi_str_mv | 10.48550/arxiv.2411.09449 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_09449</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_09449</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_094493</originalsourceid><addsrcrecordid>eNqFjr0OgjAUhbs4GPUBnOwLgKAlEVeDSgKLYSc3cq1NSmtK-fHtRSCuTufk5MvJR8ja91x2CAJvC6YTjbtjvu96IWPhnLRxCRzpDTkqNGCFVkcaNSDrvitOM-ysY7UzYqkuUNJGAL1MeI_EBSor7iDpCLXCPmlaSytKXfRrAqZfE1C8_n1USzJ7gKxwNeWCbM5Rdro6g2L-MqIE886_qvmguv9PfAAOv0py</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models</title><source>arXiv.org</source><creator>Meng, Chutian ; Ma, Fan ; Miao, Jiaxu ; Zhang, Chi ; Yang, Yi ; Zhuang, Yueting</creator><creatorcontrib>Meng, Chutian ; Ma, Fan ; Miao, Jiaxu ; Zhang, Chi ; Yang, Yi ; Zhuang, Yueting</creatorcontrib><description>Diffusion models have revitalized the image generation domain, playing
crucial roles in both academic research and artistic expression. With the
emergence of new diffusion models, assessing the performance of text-to-image
models has become increasingly important. Current metrics focus on directly
matching the input text with the generated image, but due to cross-modal
information asymmetry, this leads to unreliable or incomplete assessment
results. Motivated by this, we introduce the Image Regeneration task in this
study to assess text-to-image models by tasking the T2I model with generating
an image according to the reference image. We use GPT4V to bridge the gap
between the reference image and the text input for the T2I model, allowing T2I
models to understand image content. This evaluation process is simplified as
comparisons between the generated image and the reference image are
straightforward. Two regeneration datasets spanning content-diverse and
style-diverse evaluation dataset are introduced to evaluate the leading
diffusion models currently available. Additionally, we present ImageRepainter
framework to enhance the quality of generated images by improving content
comprehension via MLLM guided iterative generation and revision. Our
comprehensive experiments have showcased the effectiveness of this framework in
assessing the generative capabilities of models. By leveraging MLLM, we have
demonstrated that a robust T2M can produce images more closely resembling the
reference image.</description><identifier>DOI: 10.48550/arxiv.2411.09449</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.09449$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.09449$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Meng, Chutian</creatorcontrib><creatorcontrib>Ma, Fan</creatorcontrib><creatorcontrib>Miao, Jiaxu</creatorcontrib><creatorcontrib>Zhang, Chi</creatorcontrib><creatorcontrib>Yang, Yi</creatorcontrib><creatorcontrib>Zhuang, Yueting</creatorcontrib><title>Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models</title><description>Diffusion models have revitalized the image generation domain, playing
crucial roles in both academic research and artistic expression. With the
emergence of new diffusion models, assessing the performance of text-to-image
models has become increasingly important. Current metrics focus on directly
matching the input text with the generated image, but due to cross-modal
information asymmetry, this leads to unreliable or incomplete assessment
results. Motivated by this, we introduce the Image Regeneration task in this
study to assess text-to-image models by tasking the T2I model with generating
an image according to the reference image. We use GPT4V to bridge the gap
between the reference image and the text input for the T2I model, allowing T2I
models to understand image content. This evaluation process is simplified as
comparisons between the generated image and the reference image are
straightforward. Two regeneration datasets spanning content-diverse and
style-diverse evaluation dataset are introduced to evaluate the leading
diffusion models currently available. Additionally, we present ImageRepainter
framework to enhance the quality of generated images by improving content
comprehension via MLLM guided iterative generation and revision. Our
comprehensive experiments have showcased the effectiveness of this framework in
assessing the generative capabilities of models. By leveraging MLLM, we have
demonstrated that a robust T2M can produce images more closely resembling the
reference image.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjr0OgjAUhbs4GPUBnOwLgKAlEVeDSgKLYSc3cq1NSmtK-fHtRSCuTufk5MvJR8ja91x2CAJvC6YTjbtjvu96IWPhnLRxCRzpDTkqNGCFVkcaNSDrvitOM-ysY7UzYqkuUNJGAL1MeI_EBSor7iDpCLXCPmlaSytKXfRrAqZfE1C8_n1USzJ7gKxwNeWCbM5Rdro6g2L-MqIE886_qvmguv9PfAAOv0py</recordid><startdate>20241114</startdate><enddate>20241114</enddate><creator>Meng, Chutian</creator><creator>Ma, Fan</creator><creator>Miao, Jiaxu</creator><creator>Zhang, Chi</creator><creator>Yang, Yi</creator><creator>Zhuang, Yueting</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241114</creationdate><title>Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models</title><author>Meng, Chutian ; Ma, Fan ; Miao, Jiaxu ; Zhang, Chi ; Yang, Yi ; Zhuang, Yueting</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_094493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Meng, Chutian</creatorcontrib><creatorcontrib>Ma, Fan</creatorcontrib><creatorcontrib>Miao, Jiaxu</creatorcontrib><creatorcontrib>Zhang, Chi</creatorcontrib><creatorcontrib>Yang, Yi</creatorcontrib><creatorcontrib>Zhuang, Yueting</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Meng, Chutian</au><au>Ma, Fan</au><au>Miao, Jiaxu</au><au>Zhang, Chi</au><au>Yang, Yi</au><au>Zhuang, Yueting</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models</atitle><date>2024-11-14</date><risdate>2024</risdate><abstract>Diffusion models have revitalized the image generation domain, playing
crucial roles in both academic research and artistic expression. With the
emergence of new diffusion models, assessing the performance of text-to-image
models has become increasingly important. Current metrics focus on directly
matching the input text with the generated image, but due to cross-modal
information asymmetry, this leads to unreliable or incomplete assessment
results. Motivated by this, we introduce the Image Regeneration task in this
study to assess text-to-image models by tasking the T2I model with generating
an image according to the reference image. We use GPT4V to bridge the gap
between the reference image and the text input for the T2I model, allowing T2I
models to understand image content. This evaluation process is simplified as
comparisons between the generated image and the reference image are
straightforward. Two regeneration datasets spanning content-diverse and
style-diverse evaluation dataset are introduced to evaluate the leading
diffusion models currently available. Additionally, we present ImageRepainter
framework to enhance the quality of generated images by improving content
comprehension via MLLM guided iterative generation and revision. Our
comprehensive experiments have showcased the effectiveness of this framework in
assessing the generative capabilities of models. By leveraging MLLM, we have
demonstrated that a robust T2M can produce images more closely resembling the
reference image.</abstract><doi>10.48550/arxiv.2411.09449</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2411.09449 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2411_09449 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T23%3A45%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Image%20Regeneration:%20Evaluating%20Text-to-Image%20Model%20via%20Generating%20Identical%20Image%20with%20Multimodal%20Large%20Language%20Models&rft.au=Meng,%20Chutian&rft.date=2024-11-14&rft_id=info:doi/10.48550/arxiv.2411.09449&rft_dat=%3Carxiv_GOX%3E2411_09449%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |