Temporal video scene segmentation using deep-learning

The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia tools and applications 2021-05, Vol.80 (12), p.17487-17513
Hauptverfasser:	Trojahn, Tiago Henrique, Goularte, Rudinei
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Computer Communication Networks Computer Science Data Structures and Information Theory Deep learning Feature extraction Multimedia Information Systems Neural networks Recurrent neural networks Segmentation Segments Special Purpose and Application-Based Systems
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	17513
container_issue	12
container_start_page	17487
container_title	Multimedia tools and applications
container_volume	80
creator	Trojahn, Tiago Henrique Goularte, Rudinei
description	The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.
doi_str_mv	10.1007/s11042-020-10450-2
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2529604495</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2529604495</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</originalsourceid><addsrcrecordid>eNp9UMtKxEAQHETBdfUHPAU8j3bPI5McZfEFC17W8zCbdEKW7CTOJIJ_72gEb576VVVdFGPXCLcIYO4iIijBQQBPjQYuTtgKtZHcGIGnqZcFcKMBz9lFjAcAzLVQK6Z3dByH4Prso6tpyGJFnrJI7ZH85KZu8NkcO99mNdHIe3LBp-mSnTWuj3T1W9fs7fFht3nm29enl839llcSy4mLGpVRycG-2BvIJeaVU1gWJZimAgWiVs5oU2hNRmHaVQWoPLmUoFw6yjW7WXTHMLzPFCd7GObg00srtChzUKrUCSUWVBWGGAM1dgzd0YVPi2C_47FLPDbFY3_isSKR5EKKCexbCn_S_7C-AE66ZTQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2529604495</pqid></control><display><type>article</type><title>Temporal video scene segmentation using deep-learning</title><source>SpringerLink Journals - AutoHoldings</source><creator>Trojahn, Tiago Henrique ; Goularte, Rudinei</creator><creatorcontrib>Trojahn, Tiago Henrique ; Goularte, Rudinei</creatorcontrib><description>The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.</description><identifier>ISSN: 1380-7501</identifier><identifier>EISSN: 1573-7721</identifier><identifier>DOI: 10.1007/s11042-020-10450-2</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial neural networks ; Computer Communication Networks ; Computer Science ; Data Structures and Information Theory ; Deep learning ; Feature extraction ; Multimedia Information Systems ; Neural networks ; Recurrent neural networks ; Segmentation ; Segments ; Special Purpose and Application-Based Systems</subject><ispartof>Multimedia tools and applications, 2021-05, Vol.80 (12), p.17487-17513</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021</rights><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</citedby><cites>FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</cites><orcidid>0000-0002-1826-4456</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11042-020-10450-2$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11042-020-10450-2$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Trojahn, Tiago Henrique</creatorcontrib><creatorcontrib>Goularte, Rudinei</creatorcontrib><title>Temporal video scene segmentation using deep-learning</title><title>Multimedia tools and applications</title><addtitle>Multimed Tools Appl</addtitle><description>The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.</description><subject>Artificial neural networks</subject><subject>Computer Communication Networks</subject><subject>Computer Science</subject><subject>Data Structures and Information Theory</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Multimedia Information Systems</subject><subject>Neural networks</subject><subject>Recurrent neural networks</subject><subject>Segmentation</subject><subject>Segments</subject><subject>Special Purpose and Application-Based Systems</subject><issn>1380-7501</issn><issn>1573-7721</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNp9UMtKxEAQHETBdfUHPAU8j3bPI5McZfEFC17W8zCbdEKW7CTOJIJ_72gEb576VVVdFGPXCLcIYO4iIijBQQBPjQYuTtgKtZHcGIGnqZcFcKMBz9lFjAcAzLVQK6Z3dByH4Prso6tpyGJFnrJI7ZH85KZu8NkcO99mNdHIe3LBp-mSnTWuj3T1W9fs7fFht3nm29enl839llcSy4mLGpVRycG-2BvIJeaVU1gWJZimAgWiVs5oU2hNRmHaVQWoPLmUoFw6yjW7WXTHMLzPFCd7GObg00srtChzUKrUCSUWVBWGGAM1dgzd0YVPi2C_47FLPDbFY3_isSKR5EKKCexbCn_S_7C-AE66ZTQ</recordid><startdate>20210501</startdate><enddate>20210501</enddate><creator>Trojahn, Tiago Henrique</creator><creator>Goularte, Rudinei</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-1826-4456</orcidid></search><sort><creationdate>20210501</creationdate><title>Temporal video scene segmentation using deep-learning</title><author>Trojahn, Tiago Henrique ; Goularte, Rudinei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Artificial neural networks</topic><topic>Computer Communication Networks</topic><topic>Computer Science</topic><topic>Data Structures and Information Theory</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Multimedia Information Systems</topic><topic>Neural networks</topic><topic>Recurrent neural networks</topic><topic>Segmentation</topic><topic>Segments</topic><topic>Special Purpose and Application-Based Systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Trojahn, Tiago Henrique</creatorcontrib><creatorcontrib>Goularte, Rudinei</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ABI/INFORM Collection</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>One Business (ProQuest)</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><jtitle>Multimedia tools and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Trojahn, Tiago Henrique</au><au>Goularte, Rudinei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Temporal video scene segmentation using deep-learning</atitle><jtitle>Multimedia tools and applications</jtitle><stitle>Multimed Tools Appl</stitle><date>2021-05-01</date><risdate>2021</risdate><volume>80</volume><issue>12</issue><spage>17487</spage><epage>17513</epage><pages>17487-17513</pages><issn>1380-7501</issn><eissn>1573-7721</eissn><abstract>The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11042-020-10450-2</doi><tpages>27</tpages><orcidid>https://orcid.org/0000-0002-1826-4456</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1380-7501
ispartof	Multimedia tools and applications, 2021-05, Vol.80 (12), p.17487-17513
issn	1380-7501 1573-7721
language	eng
recordid	cdi_proquest_journals_2529604495
source	SpringerLink Journals - AutoHoldings
subjects	Artificial neural networks Computer Communication Networks Computer Science Data Structures and Information Theory Deep learning Feature extraction Multimedia Information Systems Neural networks Recurrent neural networks Segmentation Segments Special Purpose and Application-Based Systems
title	Temporal video scene segmentation using deep-learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T00%3A29%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Temporal%20video%20scene%20segmentation%20using%20deep-learning&rft.jtitle=Multimedia%20tools%20and%20applications&rft.au=Trojahn,%20Tiago%20Henrique&rft.date=2021-05-01&rft.volume=80&rft.issue=12&rft.spage=17487&rft.epage=17513&rft.pages=17487-17513&rft.issn=1380-7501&rft.eissn=1573-7721&rft_id=info:doi/10.1007/s11042-020-10450-2&rft_dat=%3Cproquest_cross%3E2529604495%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2529604495&rft_id=info:pmid/&rfr_iscdi=true