DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dyna...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Huang, Minbin, Long, Yanxin, Deng, Xinchi, Chu, Ruihang, Xiong, Jiangfeng, Liang, Xiaodan, Cheng, Hong, Lu, Qinglin, Liu, Wei
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Huang, Minbin Long, Yanxin Deng, Xinchi Chu, Ruihang Xiong, Jiangfeng Liang, Xiaodan Cheng, Hong Lu, Qinglin Liu, Wei
description	Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
doi_str_mv	10.48550/arxiv.2403.08857
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_08857</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_08857</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-a24dd51140acbef3c80d51da0af64e53191162b0063ef0e0b5bedccf1f63fd903</originalsourceid><addsrcrecordid>eNotj7FOwzAURb0woMIHMOEfcHiO4yTthgqUSEUMZOoSvcTPlaUkRq5TtX9PaDtdXeneIx3GniQkWak1vGA4uWOSZqASKEtd3LPdm8Pe7zc0rvjX1EcnBm-w59UYKWAX3ZH4dTIR_zkfIg3c-nDbximMvKZTFNGLasA98Zk0H6Pz4wO7s9gf6PGWC1Z_vNfrT7H93lTr163AvCgEppkxWsoMsGvJqq6EuRoEtHlGWsmllHnaAuSKLBC0uiXTdVbaXFmzBLVgz1fsRa75DW7AcG7-JZuLpPoDyT5N4Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation</title><source>arXiv.org</source><creator>Huang, Minbin ; Long, Yanxin ; Deng, Xinchi ; Chu, Ruihang ; Xiong, Jiangfeng ; Liang, Xiaodan ; Cheng, Hong ; Lu, Qinglin ; Liu, Wei</creator><creatorcontrib>Huang, Minbin ; Long, Yanxin ; Deng, Xinchi ; Chu, Ruihang ; Xiong, Jiangfeng ; Liang, Xiaodan ; Cheng, Hong ; Lu, Qinglin ; Liu, Wei</creatorcontrib><description>Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.</description><identifier>DOI: 10.48550/arxiv.2403.08857</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.08857$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.08857$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Huang, Minbin</creatorcontrib><creatorcontrib>Long, Yanxin</creatorcontrib><creatorcontrib>Deng, Xinchi</creatorcontrib><creatorcontrib>Chu, Ruihang</creatorcontrib><creatorcontrib>Xiong, Jiangfeng</creatorcontrib><creatorcontrib>Liang, Xiaodan</creatorcontrib><creatorcontrib>Cheng, Hong</creatorcontrib><creatorcontrib>Lu, Qinglin</creatorcontrib><creatorcontrib>Liu, Wei</creatorcontrib><title>DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation</title><description>Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj7FOwzAURb0woMIHMOEfcHiO4yTthgqUSEUMZOoSvcTPlaUkRq5TtX9PaDtdXeneIx3GniQkWak1vGA4uWOSZqASKEtd3LPdm8Pe7zc0rvjX1EcnBm-w59UYKWAX3ZH4dTIR_zkfIg3c-nDbximMvKZTFNGLasA98Zk0H6Pz4wO7s9gf6PGWC1Z_vNfrT7H93lTr163AvCgEppkxWsoMsGvJqq6EuRoEtHlGWsmllHnaAuSKLBC0uiXTdVbaXFmzBLVgz1fsRa75DW7AcG7-JZuLpPoDyT5N4Q</recordid><startdate>20240313</startdate><enddate>20240313</enddate><creator>Huang, Minbin</creator><creator>Long, Yanxin</creator><creator>Deng, Xinchi</creator><creator>Chu, Ruihang</creator><creator>Xiong, Jiangfeng</creator><creator>Liang, Xiaodan</creator><creator>Cheng, Hong</creator><creator>Lu, Qinglin</creator><creator>Liu, Wei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240313</creationdate><title>DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation</title><author>Huang, Minbin ; Long, Yanxin ; Deng, Xinchi ; Chu, Ruihang ; Xiong, Jiangfeng ; Liang, Xiaodan ; Cheng, Hong ; Lu, Qinglin ; Liu, Wei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-a24dd51140acbef3c80d51da0af64e53191162b0063ef0e0b5bedccf1f63fd903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Huang, Minbin</creatorcontrib><creatorcontrib>Long, Yanxin</creatorcontrib><creatorcontrib>Deng, Xinchi</creatorcontrib><creatorcontrib>Chu, Ruihang</creatorcontrib><creatorcontrib>Xiong, Jiangfeng</creatorcontrib><creatorcontrib>Liang, Xiaodan</creatorcontrib><creatorcontrib>Cheng, Hong</creatorcontrib><creatorcontrib>Lu, Qinglin</creatorcontrib><creatorcontrib>Liu, Wei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Huang, Minbin</au><au>Long, Yanxin</au><au>Deng, Xinchi</au><au>Chu, Ruihang</au><au>Xiong, Jiangfeng</au><au>Liang, Xiaodan</au><au>Cheng, Hong</au><au>Lu, Qinglin</au><au>Liu, Wei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation</atitle><date>2024-03-13</date><risdate>2024</risdate><abstract>Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.</abstract><doi>10.48550/arxiv.2403.08857</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2403.08857
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2403_08857
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T08%3A36%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DialogGen:%20Multi-modal%20Interactive%20Dialogue%20System%20for%20Multi-turn%20Text-to-Image%20Generation&rft.au=Huang,%20Minbin&rft.date=2024-03-13&rft_id=info:doi/10.48550/arxiv.2403.08857&rft_dat=%3Carxiv_GOX%3E2403_08857%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true