Sound of Story: Multi-modal Storytelling with Audio
Storytelling is multi-modal in the real world. When one tells a story, one may use all of the visualizations and sounds along with the story itself. However, prior studies on storytelling datasets and tasks have paid little attention to sound even though sound also conveys meaningful semantics of th...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Bae, Jaeyeon Jeong, Seokhoon Kang, Seokun Han, Namgi Lee, Jae-Yon Kim, Hyounghun Kim, Taehwan |
description | Storytelling is multi-modal in the real world. When one tells a story, one
may use all of the visualizations and sounds along with the story itself.
However, prior studies on storytelling datasets and tasks have paid little
attention to sound even though sound also conveys meaningful semantics of the
story. Therefore, we propose to extend story understanding and telling areas by
establishing a new component called "background sound" which is story
context-based audio without any linguistic information. For this purpose, we
introduce a new dataset, called "Sound of Story (SoS)", which has paired image
and text sequences with corresponding sound or background music for a story. To
the best of our knowledge, this is the largest well-curated dataset for
storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6
images per story and 984 hours of speech-decoupled audio such as background
music and other sounds. As benchmark tasks for storytelling with sound and the
dataset, we propose retrieval tasks between modalities, and audio generation
tasks from image-text sequences, introducing strong baselines for them. We
believe the proposed dataset and tasks may shed light on the multi-modal
understanding of storytelling in terms of sound. Downloading the dataset and
baseline codes for each task will be released in the link:
https://github.com/Sosdatasets/SoS_Dataset. |
doi_str_mv | 10.48550/arxiv.2310.19264 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_19264</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_19264</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-97ca6feedb0dd447a13b2c1d7f621f95ce4299d852eb752be0538aabf42ad4893</originalsourceid><addsrcrecordid>eNotzr1uwjAYhWEvDFXoBXTCNxAa_8UxG0JtqUTFkOzR53w2WDIYpUnb3H0pdDrSOxw9hDyxYikrpYpn6H_C15KLa2CGl_KBiDqNZ6TJ03pI_bSiH2McQn5KCPGeBhdjOB_odxiOdD1iSHMy8xA_3eP_ZqR5fWk223y3f3vfrHc5lFrmRndQeufQFohSamDC8o6h9iVn3qjOSW4MVoo7qxW3rlCiArBeckBZGZGRxf32pm4vfThBP7V_-vamF78oJz93</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Sound of Story: Multi-modal Storytelling with Audio</title><source>arXiv.org</source><creator>Bae, Jaeyeon ; Jeong, Seokhoon ; Kang, Seokun ; Han, Namgi ; Lee, Jae-Yon ; Kim, Hyounghun ; Kim, Taehwan</creator><creatorcontrib>Bae, Jaeyeon ; Jeong, Seokhoon ; Kang, Seokun ; Han, Namgi ; Lee, Jae-Yon ; Kim, Hyounghun ; Kim, Taehwan</creatorcontrib><description>Storytelling is multi-modal in the real world. When one tells a story, one
may use all of the visualizations and sounds along with the story itself.
However, prior studies on storytelling datasets and tasks have paid little
attention to sound even though sound also conveys meaningful semantics of the
story. Therefore, we propose to extend story understanding and telling areas by
establishing a new component called "background sound" which is story
context-based audio without any linguistic information. For this purpose, we
introduce a new dataset, called "Sound of Story (SoS)", which has paired image
and text sequences with corresponding sound or background music for a story. To
the best of our knowledge, this is the largest well-curated dataset for
storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6
images per story and 984 hours of speech-decoupled audio such as background
music and other sounds. As benchmark tasks for storytelling with sound and the
dataset, we propose retrieval tasks between modalities, and audio generation
tasks from image-text sequences, introducing strong baselines for them. We
believe the proposed dataset and tasks may shed light on the multi-modal
understanding of storytelling in terms of sound. Downloading the dataset and
baseline codes for each task will be released in the link:
https://github.com/Sosdatasets/SoS_Dataset.</description><identifier>DOI: 10.48550/arxiv.2310.19264</identifier><language>eng</language><subject>Computer Science - Multimedia ; Computer Science - Sound</subject><creationdate>2023-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,883</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.19264$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.19264$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Bae, Jaeyeon</creatorcontrib><creatorcontrib>Jeong, Seokhoon</creatorcontrib><creatorcontrib>Kang, Seokun</creatorcontrib><creatorcontrib>Han, Namgi</creatorcontrib><creatorcontrib>Lee, Jae-Yon</creatorcontrib><creatorcontrib>Kim, Hyounghun</creatorcontrib><creatorcontrib>Kim, Taehwan</creatorcontrib><title>Sound of Story: Multi-modal Storytelling with Audio</title><description>Storytelling is multi-modal in the real world. When one tells a story, one
may use all of the visualizations and sounds along with the story itself.
However, prior studies on storytelling datasets and tasks have paid little
attention to sound even though sound also conveys meaningful semantics of the
story. Therefore, we propose to extend story understanding and telling areas by
establishing a new component called "background sound" which is story
context-based audio without any linguistic information. For this purpose, we
introduce a new dataset, called "Sound of Story (SoS)", which has paired image
and text sequences with corresponding sound or background music for a story. To
the best of our knowledge, this is the largest well-curated dataset for
storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6
images per story and 984 hours of speech-decoupled audio such as background
music and other sounds. As benchmark tasks for storytelling with sound and the
dataset, we propose retrieval tasks between modalities, and audio generation
tasks from image-text sequences, introducing strong baselines for them. We
believe the proposed dataset and tasks may shed light on the multi-modal
understanding of storytelling in terms of sound. Downloading the dataset and
baseline codes for each task will be released in the link:
https://github.com/Sosdatasets/SoS_Dataset.</description><subject>Computer Science - Multimedia</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzr1uwjAYhWEvDFXoBXTCNxAa_8UxG0JtqUTFkOzR53w2WDIYpUnb3H0pdDrSOxw9hDyxYikrpYpn6H_C15KLa2CGl_KBiDqNZ6TJ03pI_bSiH2McQn5KCPGeBhdjOB_odxiOdD1iSHMy8xA_3eP_ZqR5fWk223y3f3vfrHc5lFrmRndQeufQFohSamDC8o6h9iVn3qjOSW4MVoo7qxW3rlCiArBeckBZGZGRxf32pm4vfThBP7V_-vamF78oJz93</recordid><startdate>20231030</startdate><enddate>20231030</enddate><creator>Bae, Jaeyeon</creator><creator>Jeong, Seokhoon</creator><creator>Kang, Seokun</creator><creator>Han, Namgi</creator><creator>Lee, Jae-Yon</creator><creator>Kim, Hyounghun</creator><creator>Kim, Taehwan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231030</creationdate><title>Sound of Story: Multi-modal Storytelling with Audio</title><author>Bae, Jaeyeon ; Jeong, Seokhoon ; Kang, Seokun ; Han, Namgi ; Lee, Jae-Yon ; Kim, Hyounghun ; Kim, Taehwan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-97ca6feedb0dd447a13b2c1d7f621f95ce4299d852eb752be0538aabf42ad4893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Multimedia</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Bae, Jaeyeon</creatorcontrib><creatorcontrib>Jeong, Seokhoon</creatorcontrib><creatorcontrib>Kang, Seokun</creatorcontrib><creatorcontrib>Han, Namgi</creatorcontrib><creatorcontrib>Lee, Jae-Yon</creatorcontrib><creatorcontrib>Kim, Hyounghun</creatorcontrib><creatorcontrib>Kim, Taehwan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bae, Jaeyeon</au><au>Jeong, Seokhoon</au><au>Kang, Seokun</au><au>Han, Namgi</au><au>Lee, Jae-Yon</au><au>Kim, Hyounghun</au><au>Kim, Taehwan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Sound of Story: Multi-modal Storytelling with Audio</atitle><date>2023-10-30</date><risdate>2023</risdate><abstract>Storytelling is multi-modal in the real world. When one tells a story, one
may use all of the visualizations and sounds along with the story itself.
However, prior studies on storytelling datasets and tasks have paid little
attention to sound even though sound also conveys meaningful semantics of the
story. Therefore, we propose to extend story understanding and telling areas by
establishing a new component called "background sound" which is story
context-based audio without any linguistic information. For this purpose, we
introduce a new dataset, called "Sound of Story (SoS)", which has paired image
and text sequences with corresponding sound or background music for a story. To
the best of our knowledge, this is the largest well-curated dataset for
storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6
images per story and 984 hours of speech-decoupled audio such as background
music and other sounds. As benchmark tasks for storytelling with sound and the
dataset, we propose retrieval tasks between modalities, and audio generation
tasks from image-text sequences, introducing strong baselines for them. We
believe the proposed dataset and tasks may shed light on the multi-modal
understanding of storytelling in terms of sound. Downloading the dataset and
baseline codes for each task will be released in the link:
https://github.com/Sosdatasets/SoS_Dataset.</abstract><doi>10.48550/arxiv.2310.19264</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2310.19264 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2310_19264 |
source | arXiv.org |
subjects | Computer Science - Multimedia Computer Science - Sound |
title | Sound of Story: Multi-modal Storytelling with Audio |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T18%3A44%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Sound%20of%20Story:%20Multi-modal%20Storytelling%20with%20Audio&rft.au=Bae,%20Jaeyeon&rft.date=2023-10-30&rft_id=info:doi/10.48550/arxiv.2310.19264&rft_dat=%3Carxiv_GOX%3E2310_19264%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |