BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involvi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Kim, Gwanghyun Kim, Hayeon Seo, Hoigi Kang, Dong Un Chun, Se Young |
description | Generating higher-resolution human-centric scenes with details and controls
remains a challenge for existing text-to-image diffusion models. This challenge
stems from limited training image size, text encoder capacity (limited tokens),
and the inherent difficulty of generating complex scenes involving multiple
humans. While current methods attempted to address training size limit only,
they often yielded human-centric scenes with severe artifacts. We propose
BeyondScene, a novel framework that overcomes prior limitations, generating
exquisite higher-resolution (over 8K) human-centric scenes with exceptional
text-image correspondence and naturalness using existing pretrained diffusion
models. BeyondScene employs a staged and hierarchical approach to initially
generate a detailed base image focusing on crucial elements in instance
creation for multiple humans and detailed descriptions beyond token limit of
diffusion model, and then to seamlessly convert the base image to a
higher-resolution output, exceeding training image size and incorporating
details aware of text and instances via our novel instance-aware hierarchical
enlargement process that consists of our proposed high-frequency injected
forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing
methods in terms of correspondence with detailed text descriptions and
naturalness, paving the way for advanced applications in higher-resolution
human-centric scene creation beyond the capacity of pretrained diffusion models
without costly retraining. Project page:
https://janeyeon.github.io/beyond-scene. |
doi_str_mv | 10.48550/arxiv.2404.04544 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2404_04544</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404_04544</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-5ec2e4df65122fdb225e6ad8616c4fc83480a0079070a6a54fcd5b56e9c932123</originalsourceid><addsrcrecordid>eNotj8FOwzAQRH3hgAofwAn_QILjrJ2EGwRokCqB2ko9Rlt7TS21DnISRP-eELjMSDOjkR5jN5lIoVRK3GH89l-pBAGpAAVwyXaPdO6C3RgKdM8b_3GgmKyp747j4LvAm_GEIakpDNEbPs_4cpKIc73zw4G_Rxoi-kCWP3nnxn5qrtiFw2NP1_--YNuX523dJKu35Wv9sEpQF5AoMpLAOq0yKZ3dS6lIoy11pg04U-ZQChSiqEQhUKOaMqv2SlNlqlxmMl-w27_bmaz9jP6E8dz-ErYzYf4DsJ5L-Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion</title><source>arXiv.org</source><creator>Kim, Gwanghyun ; Kim, Hayeon ; Seo, Hoigi ; Kang, Dong Un ; Chun, Se Young</creator><creatorcontrib>Kim, Gwanghyun ; Kim, Hayeon ; Seo, Hoigi ; Kang, Dong Un ; Chun, Se Young</creatorcontrib><description>Generating higher-resolution human-centric scenes with details and controls
remains a challenge for existing text-to-image diffusion models. This challenge
stems from limited training image size, text encoder capacity (limited tokens),
and the inherent difficulty of generating complex scenes involving multiple
humans. While current methods attempted to address training size limit only,
they often yielded human-centric scenes with severe artifacts. We propose
BeyondScene, a novel framework that overcomes prior limitations, generating
exquisite higher-resolution (over 8K) human-centric scenes with exceptional
text-image correspondence and naturalness using existing pretrained diffusion
models. BeyondScene employs a staged and hierarchical approach to initially
generate a detailed base image focusing on crucial elements in instance
creation for multiple humans and detailed descriptions beyond token limit of
diffusion model, and then to seamlessly convert the base image to a
higher-resolution output, exceeding training image size and incorporating
details aware of text and instances via our novel instance-aware hierarchical
enlargement process that consists of our proposed high-frequency injected
forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing
methods in terms of correspondence with detailed text descriptions and
naturalness, paving the way for advanced applications in higher-resolution
human-centric scene creation beyond the capacity of pretrained diffusion models
without costly retraining. Project page:
https://janeyeon.github.io/beyond-scene.</description><identifier>DOI: 10.48550/arxiv.2404.04544</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-04</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,782,887</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2404.04544$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2404.04544$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kim, Gwanghyun</creatorcontrib><creatorcontrib>Kim, Hayeon</creatorcontrib><creatorcontrib>Seo, Hoigi</creatorcontrib><creatorcontrib>Kang, Dong Un</creatorcontrib><creatorcontrib>Chun, Se Young</creatorcontrib><title>BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion</title><description>Generating higher-resolution human-centric scenes with details and controls
remains a challenge for existing text-to-image diffusion models. This challenge
stems from limited training image size, text encoder capacity (limited tokens),
and the inherent difficulty of generating complex scenes involving multiple
humans. While current methods attempted to address training size limit only,
they often yielded human-centric scenes with severe artifacts. We propose
BeyondScene, a novel framework that overcomes prior limitations, generating
exquisite higher-resolution (over 8K) human-centric scenes with exceptional
text-image correspondence and naturalness using existing pretrained diffusion
models. BeyondScene employs a staged and hierarchical approach to initially
generate a detailed base image focusing on crucial elements in instance
creation for multiple humans and detailed descriptions beyond token limit of
diffusion model, and then to seamlessly convert the base image to a
higher-resolution output, exceeding training image size and incorporating
details aware of text and instances via our novel instance-aware hierarchical
enlargement process that consists of our proposed high-frequency injected
forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing
methods in terms of correspondence with detailed text descriptions and
naturalness, paving the way for advanced applications in higher-resolution
human-centric scene creation beyond the capacity of pretrained diffusion models
without costly retraining. Project page:
https://janeyeon.github.io/beyond-scene.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FOwzAQRH3hgAofwAn_QILjrJ2EGwRokCqB2ko9Rlt7TS21DnISRP-eELjMSDOjkR5jN5lIoVRK3GH89l-pBAGpAAVwyXaPdO6C3RgKdM8b_3GgmKyp747j4LvAm_GEIakpDNEbPs_4cpKIc73zw4G_Rxoi-kCWP3nnxn5qrtiFw2NP1_--YNuX523dJKu35Wv9sEpQF5AoMpLAOq0yKZ3dS6lIoy11pg04U-ZQChSiqEQhUKOaMqv2SlNlqlxmMl-w27_bmaz9jP6E8dz-ErYzYf4DsJ5L-Q</recordid><startdate>20240406</startdate><enddate>20240406</enddate><creator>Kim, Gwanghyun</creator><creator>Kim, Hayeon</creator><creator>Seo, Hoigi</creator><creator>Kang, Dong Un</creator><creator>Chun, Se Young</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240406</creationdate><title>BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion</title><author>Kim, Gwanghyun ; Kim, Hayeon ; Seo, Hoigi ; Kang, Dong Un ; Chun, Se Young</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-5ec2e4df65122fdb225e6ad8616c4fc83480a0079070a6a54fcd5b56e9c932123</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Kim, Gwanghyun</creatorcontrib><creatorcontrib>Kim, Hayeon</creatorcontrib><creatorcontrib>Seo, Hoigi</creatorcontrib><creatorcontrib>Kang, Dong Un</creatorcontrib><creatorcontrib>Chun, Se Young</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kim, Gwanghyun</au><au>Kim, Hayeon</au><au>Seo, Hoigi</au><au>Kang, Dong Un</au><au>Chun, Se Young</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion</atitle><date>2024-04-06</date><risdate>2024</risdate><abstract>Generating higher-resolution human-centric scenes with details and controls
remains a challenge for existing text-to-image diffusion models. This challenge
stems from limited training image size, text encoder capacity (limited tokens),
and the inherent difficulty of generating complex scenes involving multiple
humans. While current methods attempted to address training size limit only,
they often yielded human-centric scenes with severe artifacts. We propose
BeyondScene, a novel framework that overcomes prior limitations, generating
exquisite higher-resolution (over 8K) human-centric scenes with exceptional
text-image correspondence and naturalness using existing pretrained diffusion
models. BeyondScene employs a staged and hierarchical approach to initially
generate a detailed base image focusing on crucial elements in instance
creation for multiple humans and detailed descriptions beyond token limit of
diffusion model, and then to seamlessly convert the base image to a
higher-resolution output, exceeding training image size and incorporating
details aware of text and instances via our novel instance-aware hierarchical
enlargement process that consists of our proposed high-frequency injected
forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing
methods in terms of correspondence with detailed text descriptions and
naturalness, paving the way for advanced applications in higher-resolution
human-centric scene creation beyond the capacity of pretrained diffusion models
without costly retraining. Project page:
https://janeyeon.github.io/beyond-scene.</abstract><doi>10.48550/arxiv.2404.04544</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2404.04544 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2404_04544 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition |
title | BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-03T06%3A33%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=BeyondScene:%20Higher-Resolution%20Human-Centric%20Scene%20Generation%20With%20Pretrained%20Diffusion&rft.au=Kim,%20Gwanghyun&rft.date=2024-04-06&rft_id=info:doi/10.48550/arxiv.2404.04544&rft_dat=%3Carxiv_GOX%3E2404_04544%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |