Improving Generation and Evaluation of Visual Stories via Semantic Consistency

Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images tha...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Maharana, Adyasha, Hannan, Darryl, Bansal, Mohit
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Maharana, Adyasha
Hannan, Darryl
Bansal, Mohit
description Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations. Code and data available at: https://github.com/adymaharana/StoryViz
doi_str_mv 10.48550/arxiv.2105.10026
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2105_10026</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2105_10026</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-556ab286acb25b602cf57275e2f1f8c87e4e5d69ab040836336086716f10a2eb3</originalsourceid><addsrcrecordid>eNotz71qwzAYhWEtHUraC-hU3YBd_ViflLGYNA2EZkjoaj4pUhDYcpAc09x9adLp8C4HHkJeOKsboxR7w_wT51pwpmrOmIBH8rUZznmcYzrRtU8-4xTHRDEd6WrG_nLPMdDvWC7Y0_005ugLnSPSvR8wTdHRdkwllsknd30iDwH74p__d0EOH6tD-1ltd-tN-76tEDRUSgFaYQCdFcoCEy4oLbTyIvBgnNG-8eoIS7SsYUaClMAMaA6BMxTeygV5vd_eQN05xwHztfuDdTeY_AW37kiT</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Improving Generation and Evaluation of Visual Stories via Semantic Consistency</title><source>arXiv.org</source><creator>Maharana, Adyasha ; Hannan, Darryl ; Bansal, Mohit</creator><creatorcontrib>Maharana, Adyasha ; Hannan, Darryl ; Bansal, Mohit</creatorcontrib><description>Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations. Code and data available at: https://github.com/adymaharana/StoryViz</description><identifier>DOI: 10.48550/arxiv.2105.10026</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2021-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2105.10026$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2105.10026$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Maharana, Adyasha</creatorcontrib><creatorcontrib>Hannan, Darryl</creatorcontrib><creatorcontrib>Bansal, Mohit</creatorcontrib><title>Improving Generation and Evaluation of Visual Stories via Semantic Consistency</title><description>Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations. Code and data available at: https://github.com/adymaharana/StoryViz</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71qwzAYhWEtHUraC-hU3YBd_ViflLGYNA2EZkjoaj4pUhDYcpAc09x9adLp8C4HHkJeOKsboxR7w_wT51pwpmrOmIBH8rUZznmcYzrRtU8-4xTHRDEd6WrG_nLPMdDvWC7Y0_005ugLnSPSvR8wTdHRdkwllsknd30iDwH74p__d0EOH6tD-1ltd-tN-76tEDRUSgFaYQCdFcoCEy4oLbTyIvBgnNG-8eoIS7SsYUaClMAMaA6BMxTeygV5vd_eQN05xwHztfuDdTeY_AW37kiT</recordid><startdate>20210520</startdate><enddate>20210520</enddate><creator>Maharana, Adyasha</creator><creator>Hannan, Darryl</creator><creator>Bansal, Mohit</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210520</creationdate><title>Improving Generation and Evaluation of Visual Stories via Semantic Consistency</title><author>Maharana, Adyasha ; Hannan, Darryl ; Bansal, Mohit</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-556ab286acb25b602cf57275e2f1f8c87e4e5d69ab040836336086716f10a2eb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Maharana, Adyasha</creatorcontrib><creatorcontrib>Hannan, Darryl</creatorcontrib><creatorcontrib>Bansal, Mohit</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Maharana, Adyasha</au><au>Hannan, Darryl</au><au>Bansal, Mohit</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improving Generation and Evaluation of Visual Stories via Semantic Consistency</atitle><date>2021-05-20</date><risdate>2021</risdate><abstract>Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations. Code and data available at: https://github.com/adymaharana/StoryViz</abstract><doi>10.48550/arxiv.2105.10026</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2105.10026
ispartof
issn
language eng
recordid cdi_arxiv_primary_2105_10026
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
title Improving Generation and Evaluation of Visual Stories via Semantic Consistency
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T12%3A14%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improving%20Generation%20and%20Evaluation%20of%20Visual%20Stories%20via%20Semantic%20Consistency&rft.au=Maharana,%20Adyasha&rft.date=2021-05-20&rft_id=info:doi/10.48550/arxiv.2105.10026&rft_dat=%3Carxiv_GOX%3E2105_10026%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true