Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly m...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Meng, Chutian, Ma, Fan, Miao, Jiaxu, Zhang, Chi, Yang, Yi, Zhuang, Yueting
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Meng, Chutian
Ma, Fan
Miao, Jiaxu
Zhang, Chi
Yang, Yi
Zhuang, Yueting
description Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.
doi_str_mv 10.48550/arxiv.2411.09449
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_09449</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_09449</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_094493</originalsourceid><addsrcrecordid>eNqFjr0OgjAUhbs4GPUBnOwLgKAlEVeDSgKLYSc3cq1NSmtK-fHtRSCuTufk5MvJR8ja91x2CAJvC6YTjbtjvu96IWPhnLRxCRzpDTkqNGCFVkcaNSDrvitOM-ysY7UzYqkuUNJGAL1MeI_EBSor7iDpCLXCPmlaSytKXfRrAqZfE1C8_n1USzJ7gKxwNeWCbM5Rdro6g2L-MqIE886_qvmguv9PfAAOv0py</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models</title><source>arXiv.org</source><creator>Meng, Chutian ; Ma, Fan ; Miao, Jiaxu ; Zhang, Chi ; Yang, Yi ; Zhuang, Yueting</creator><creatorcontrib>Meng, Chutian ; Ma, Fan ; Miao, Jiaxu ; Zhang, Chi ; Yang, Yi ; Zhuang, Yueting</creatorcontrib><description>Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.</description><identifier>DOI: 10.48550/arxiv.2411.09449</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.09449$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.09449$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Meng, Chutian</creatorcontrib><creatorcontrib>Ma, Fan</creatorcontrib><creatorcontrib>Miao, Jiaxu</creatorcontrib><creatorcontrib>Zhang, Chi</creatorcontrib><creatorcontrib>Yang, Yi</creatorcontrib><creatorcontrib>Zhuang, Yueting</creatorcontrib><title>Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models</title><description>Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjr0OgjAUhbs4GPUBnOwLgKAlEVeDSgKLYSc3cq1NSmtK-fHtRSCuTufk5MvJR8ja91x2CAJvC6YTjbtjvu96IWPhnLRxCRzpDTkqNGCFVkcaNSDrvitOM-ysY7UzYqkuUNJGAL1MeI_EBSor7iDpCLXCPmlaSytKXfRrAqZfE1C8_n1USzJ7gKxwNeWCbM5Rdro6g2L-MqIE886_qvmguv9PfAAOv0py</recordid><startdate>20241114</startdate><enddate>20241114</enddate><creator>Meng, Chutian</creator><creator>Ma, Fan</creator><creator>Miao, Jiaxu</creator><creator>Zhang, Chi</creator><creator>Yang, Yi</creator><creator>Zhuang, Yueting</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241114</creationdate><title>Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models</title><author>Meng, Chutian ; Ma, Fan ; Miao, Jiaxu ; Zhang, Chi ; Yang, Yi ; Zhuang, Yueting</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_094493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Meng, Chutian</creatorcontrib><creatorcontrib>Ma, Fan</creatorcontrib><creatorcontrib>Miao, Jiaxu</creatorcontrib><creatorcontrib>Zhang, Chi</creatorcontrib><creatorcontrib>Yang, Yi</creatorcontrib><creatorcontrib>Zhuang, Yueting</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Meng, Chutian</au><au>Ma, Fan</au><au>Miao, Jiaxu</au><au>Zhang, Chi</au><au>Yang, Yi</au><au>Zhuang, Yueting</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models</atitle><date>2024-11-14</date><risdate>2024</risdate><abstract>Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.</abstract><doi>10.48550/arxiv.2411.09449</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2411.09449
ispartof
issn
language eng
recordid cdi_arxiv_primary_2411_09449
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T23%3A45%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Image%20Regeneration:%20Evaluating%20Text-to-Image%20Model%20via%20Generating%20Identical%20Image%20with%20Multimodal%20Large%20Language%20Models&rft.au=Meng,%20Chutian&rft.date=2024-11-14&rft_id=info:doi/10.48550/arxiv.2411.09449&rft_dat=%3Carxiv_GOX%3E2411_09449%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true