MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, w...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of computer vision 2024-09, Vol.132 (9), p.3537-3565
Hauptverfasser: Zheng, Jianbin, Liu, Daqing, Wang, Chaoyue, Hu, Minghui, Yang, Zuopeng, Ding, Changxing, Tao, Dacheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3565
container_issue 9
container_start_page 3537
container_title International journal of computer vision
container_volume 132
creator Zheng, Jianbin
Liu, Daqing
Wang, Chaoyue
Hu, Minghui
Yang, Zuopeng
Ding, Changxing
Tao, Dacheng
description Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i.e., composed multimodal conditional image synthesis (CMCIS). Specifically, we observe two challenging issues of the proposed CMCIS task, i.e., the modality coordination problem and the modality imbalance problem. To tackle these issues, we introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals, a multimodal balanced training loss to stabilize the optimization of each modality, and a multimodal sampling guidance to balance the strength of each modality control signal. Comprehensive experimental results demonstrate that MMoT achieves superior performance on both unimodal conditional image synthesis and MCIS tasks with high-quality and faithful image synthesis on complex multimodal conditions. The project website is available at https://jabir-zheng.github.io/MMoT .
doi_str_mv 10.1007/s11263-024-02044-4
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3097644221</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3097644221</sourcerecordid><originalsourceid>FETCH-LOGICAL-c270t-da43e223fc010336ba2973962e71917d05546ee45bc8ee3c45361455207b4f153</originalsourceid><addsrcrecordid>eNp9kM1Lw0AQxRdRsFb_AU8Bz6uzX9nGmxQ_Cg0ejOc1TSY1tcnW3Q3Y_96tEbx5GN4wvPcYfoRcMrhmAPrGM8ZTQYHLOCAllUdkwpQWlElQx2QCGQeq0oydkjPvNwDAZ1xMyFue2-I2yduvMDiktqG5rcttG_a0sB_Y-6RwZe8b6zp0SZRkbrud9Vgn-bANbXdwx1tft6G1fdwXXbnG5GXfh3f0rT8nJ0259Xjxq1Py-nBfzJ_o8vlxMb9b0oprCLQupUDORVMBAyHSVckzLbKUo2YZ0zUoJVNEqVbVDFFUUomUSaU46JVsmBJTcjX27pz9HNAHs7GDiw95IyDTqZScs-jio6ty1nuHjdm5tivd3jAwB5JmJGkiSfND0sgYEmPIR3O_RvdX_U_qG6qmdYM</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3097644221</pqid></control><display><type>article</type><title>MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis</title><source>SpringerNature Journals</source><creator>Zheng, Jianbin ; Liu, Daqing ; Wang, Chaoyue ; Hu, Minghui ; Yang, Zuopeng ; Ding, Changxing ; Tao, Dacheng</creator><creatorcontrib>Zheng, Jianbin ; Liu, Daqing ; Wang, Chaoyue ; Hu, Minghui ; Yang, Zuopeng ; Ding, Changxing ; Tao, Dacheng</creatorcontrib><description>Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i.e., composed multimodal conditional image synthesis (CMCIS). Specifically, we observe two challenging issues of the proposed CMCIS task, i.e., the modality coordination problem and the modality imbalance problem. To tackle these issues, we introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals, a multimodal balanced training loss to stabilize the optimization of each modality, and a multimodal sampling guidance to balance the strength of each modality control signal. Comprehensive experimental results demonstrate that MMoT achieves superior performance on both unimodal conditional image synthesis and MCIS tasks with high-quality and faithful image synthesis on complex multimodal conditions. The project website is available at https://jabir-zheng.github.io/MMoT .</description><identifier>ISSN: 0920-5691</identifier><identifier>EISSN: 1573-1405</identifier><identifier>DOI: 10.1007/s11263-024-02044-4</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial Intelligence ; Computer Imaging ; Computer Science ; Computer vision ; Image Processing and Computer Vision ; Image quality ; Mixtures ; Optimization ; Pattern Recognition ; Pattern Recognition and Graphics ; Signal quality ; Synthesis ; Task complexity ; Transformers ; Vision</subject><ispartof>International journal of computer vision, 2024-09, Vol.132 (9), p.3537-3565</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c270t-da43e223fc010336ba2973962e71917d05546ee45bc8ee3c45361455207b4f153</cites><orcidid>0009-0004-9835-3353</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11263-024-02044-4$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11263-024-02044-4$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>315,782,786,27931,27932,41495,42564,51326</link.rule.ids></links><search><creatorcontrib>Zheng, Jianbin</creatorcontrib><creatorcontrib>Liu, Daqing</creatorcontrib><creatorcontrib>Wang, Chaoyue</creatorcontrib><creatorcontrib>Hu, Minghui</creatorcontrib><creatorcontrib>Yang, Zuopeng</creatorcontrib><creatorcontrib>Ding, Changxing</creatorcontrib><creatorcontrib>Tao, Dacheng</creatorcontrib><title>MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis</title><title>International journal of computer vision</title><addtitle>Int J Comput Vis</addtitle><description>Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i.e., composed multimodal conditional image synthesis (CMCIS). Specifically, we observe two challenging issues of the proposed CMCIS task, i.e., the modality coordination problem and the modality imbalance problem. To tackle these issues, we introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals, a multimodal balanced training loss to stabilize the optimization of each modality, and a multimodal sampling guidance to balance the strength of each modality control signal. Comprehensive experimental results demonstrate that MMoT achieves superior performance on both unimodal conditional image synthesis and MCIS tasks with high-quality and faithful image synthesis on complex multimodal conditions. The project website is available at https://jabir-zheng.github.io/MMoT .</description><subject>Artificial Intelligence</subject><subject>Computer Imaging</subject><subject>Computer Science</subject><subject>Computer vision</subject><subject>Image Processing and Computer Vision</subject><subject>Image quality</subject><subject>Mixtures</subject><subject>Optimization</subject><subject>Pattern Recognition</subject><subject>Pattern Recognition and Graphics</subject><subject>Signal quality</subject><subject>Synthesis</subject><subject>Task complexity</subject><subject>Transformers</subject><subject>Vision</subject><issn>0920-5691</issn><issn>1573-1405</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kM1Lw0AQxRdRsFb_AU8Bz6uzX9nGmxQ_Cg0ejOc1TSY1tcnW3Q3Y_96tEbx5GN4wvPcYfoRcMrhmAPrGM8ZTQYHLOCAllUdkwpQWlElQx2QCGQeq0oydkjPvNwDAZ1xMyFue2-I2yduvMDiktqG5rcttG_a0sB_Y-6RwZe8b6zp0SZRkbrud9Vgn-bANbXdwx1tft6G1fdwXXbnG5GXfh3f0rT8nJ0259Xjxq1Py-nBfzJ_o8vlxMb9b0oprCLQupUDORVMBAyHSVckzLbKUo2YZ0zUoJVNEqVbVDFFUUomUSaU46JVsmBJTcjX27pz9HNAHs7GDiw95IyDTqZScs-jio6ty1nuHjdm5tivd3jAwB5JmJGkiSfND0sgYEmPIR3O_RvdX_U_qG6qmdYM</recordid><startdate>20240901</startdate><enddate>20240901</enddate><creator>Zheng, Jianbin</creator><creator>Liu, Daqing</creator><creator>Wang, Chaoyue</creator><creator>Hu, Minghui</creator><creator>Yang, Zuopeng</creator><creator>Ding, Changxing</creator><creator>Tao, Dacheng</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0009-0004-9835-3353</orcidid></search><sort><creationdate>20240901</creationdate><title>MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis</title><author>Zheng, Jianbin ; Liu, Daqing ; Wang, Chaoyue ; Hu, Minghui ; Yang, Zuopeng ; Ding, Changxing ; Tao, Dacheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c270t-da43e223fc010336ba2973962e71917d05546ee45bc8ee3c45361455207b4f153</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial Intelligence</topic><topic>Computer Imaging</topic><topic>Computer Science</topic><topic>Computer vision</topic><topic>Image Processing and Computer Vision</topic><topic>Image quality</topic><topic>Mixtures</topic><topic>Optimization</topic><topic>Pattern Recognition</topic><topic>Pattern Recognition and Graphics</topic><topic>Signal quality</topic><topic>Synthesis</topic><topic>Task complexity</topic><topic>Transformers</topic><topic>Vision</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zheng, Jianbin</creatorcontrib><creatorcontrib>Liu, Daqing</creatorcontrib><creatorcontrib>Wang, Chaoyue</creatorcontrib><creatorcontrib>Hu, Minghui</creatorcontrib><creatorcontrib>Yang, Zuopeng</creatorcontrib><creatorcontrib>Ding, Changxing</creatorcontrib><creatorcontrib>Tao, Dacheng</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>International journal of computer vision</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zheng, Jianbin</au><au>Liu, Daqing</au><au>Wang, Chaoyue</au><au>Hu, Minghui</au><au>Yang, Zuopeng</au><au>Ding, Changxing</au><au>Tao, Dacheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis</atitle><jtitle>International journal of computer vision</jtitle><stitle>Int J Comput Vis</stitle><date>2024-09-01</date><risdate>2024</risdate><volume>132</volume><issue>9</issue><spage>3537</spage><epage>3565</epage><pages>3537-3565</pages><issn>0920-5691</issn><eissn>1573-1405</eissn><abstract>Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i.e., composed multimodal conditional image synthesis (CMCIS). Specifically, we observe two challenging issues of the proposed CMCIS task, i.e., the modality coordination problem and the modality imbalance problem. To tackle these issues, we introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals, a multimodal balanced training loss to stabilize the optimization of each modality, and a multimodal sampling guidance to balance the strength of each modality control signal. Comprehensive experimental results demonstrate that MMoT achieves superior performance on both unimodal conditional image synthesis and MCIS tasks with high-quality and faithful image synthesis on complex multimodal conditions. The project website is available at https://jabir-zheng.github.io/MMoT .</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11263-024-02044-4</doi><tpages>29</tpages><orcidid>https://orcid.org/0009-0004-9835-3353</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0920-5691
ispartof International journal of computer vision, 2024-09, Vol.132 (9), p.3537-3565
issn 0920-5691
1573-1405
language eng
recordid cdi_proquest_journals_3097644221
source SpringerNature Journals
subjects Artificial Intelligence
Computer Imaging
Computer Science
Computer vision
Image Processing and Computer Vision
Image quality
Mixtures
Optimization
Pattern Recognition
Pattern Recognition and Graphics
Signal quality
Synthesis
Task complexity
Transformers
Vision
title MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-04T12%3A02%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MMoT:%20Mixture-of-Modality-Tokens%20Transformer%20for%20Composed%20Multimodal%20Conditional%20Image%20Synthesis&rft.jtitle=International%20journal%20of%20computer%20vision&rft.au=Zheng,%20Jianbin&rft.date=2024-09-01&rft.volume=132&rft.issue=9&rft.spage=3537&rft.epage=3565&rft.pages=3537-3565&rft.issn=0920-5691&rft.eissn=1573-1405&rft_id=info:doi/10.1007/s11263-024-02044-4&rft_dat=%3Cproquest_cross%3E3097644221%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3097644221&rft_id=info:pmid/&rfr_iscdi=true