BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involvi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Kim, Gwanghyun, Kim, Hayeon, Seo, Hoigi, Kang, Dong Un, Chun, Se Young
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Kim, Gwanghyun Kim, Hayeon Seo, Hoigi Kang, Dong Un Chun, Se Young
description	Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.
doi_str_mv	10.48550/arxiv.2404.04544
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2404_04544</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404_04544</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-5ec2e4df65122fdb225e6ad8616c4fc83480a0079070a6a54fcd5b56e9c932123</originalsourceid><addsrcrecordid>eNotj8FOwzAQRH3hgAofwAn_QILjrJ2EGwRokCqB2ko9Rlt7TS21DnISRP-eELjMSDOjkR5jN5lIoVRK3GH89l-pBAGpAAVwyXaPdO6C3RgKdM8b_3GgmKyp747j4LvAm_GEIakpDNEbPs_4cpKIc73zw4G_Rxoi-kCWP3nnxn5qrtiFw2NP1_--YNuX523dJKu35Wv9sEpQF5AoMpLAOq0yKZ3dS6lIoy11pg04U-ZQChSiqEQhUKOaMqv2SlNlqlxmMl-w27_bmaz9jP6E8dz-ErYzYf4DsJ5L-Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion</title><source>arXiv.org</source><creator>Kim, Gwanghyun ; Kim, Hayeon ; Seo, Hoigi ; Kang, Dong Un ; Chun, Se Young</creator><creatorcontrib>Kim, Gwanghyun ; Kim, Hayeon ; Seo, Hoigi ; Kang, Dong Un ; Chun, Se Young</creatorcontrib><description>Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.</description><identifier>DOI: 10.48550/arxiv.2404.04544</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-04</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,782,887</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2404.04544$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2404.04544$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kim, Gwanghyun</creatorcontrib><creatorcontrib>Kim, Hayeon</creatorcontrib><creatorcontrib>Seo, Hoigi</creatorcontrib><creatorcontrib>Kang, Dong Un</creatorcontrib><creatorcontrib>Chun, Se Young</creatorcontrib><title>BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion</title><description>Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FOwzAQRH3hgAofwAn_QILjrJ2EGwRokCqB2ko9Rlt7TS21DnISRP-eELjMSDOjkR5jN5lIoVRK3GH89l-pBAGpAAVwyXaPdO6C3RgKdM8b_3GgmKyp747j4LvAm_GEIakpDNEbPs_4cpKIc73zw4G_Rxoi-kCWP3nnxn5qrtiFw2NP1_--YNuX523dJKu35Wv9sEpQF5AoMpLAOq0yKZ3dS6lIoy11pg04U-ZQChSiqEQhUKOaMqv2SlNlqlxmMl-w27_bmaz9jP6E8dz-ErYzYf4DsJ5L-Q</recordid><startdate>20240406</startdate><enddate>20240406</enddate><creator>Kim, Gwanghyun</creator><creator>Kim, Hayeon</creator><creator>Seo, Hoigi</creator><creator>Kang, Dong Un</creator><creator>Chun, Se Young</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240406</creationdate><title>BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion</title><author>Kim, Gwanghyun ; Kim, Hayeon ; Seo, Hoigi ; Kang, Dong Un ; Chun, Se Young</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-5ec2e4df65122fdb225e6ad8616c4fc83480a0079070a6a54fcd5b56e9c932123</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Kim, Gwanghyun</creatorcontrib><creatorcontrib>Kim, Hayeon</creatorcontrib><creatorcontrib>Seo, Hoigi</creatorcontrib><creatorcontrib>Kang, Dong Un</creatorcontrib><creatorcontrib>Chun, Se Young</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kim, Gwanghyun</au><au>Kim, Hayeon</au><au>Seo, Hoigi</au><au>Kang, Dong Un</au><au>Chun, Se Young</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion</atitle><date>2024-04-06</date><risdate>2024</risdate><abstract>Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene.</abstract><doi>10.48550/arxiv.2404.04544</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2404.04544
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2404_04544
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
title	BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-03T06%3A33%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=BeyondScene:%20Higher-Resolution%20Human-Centric%20Scene%20Generation%20With%20Pretrained%20Diffusion&rft.au=Kim,%20Gwanghyun&rft.date=2024-04-06&rft_id=info:doi/10.48550/arxiv.2404.04544&rft_dat=%3Carxiv_GOX%3E2404_04544%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true