Temporal video scene segmentation using deep-learning
The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal...
Gespeichert in:
Veröffentlicht in: | Multimedia tools and applications 2021-05, Vol.80 (12), p.17487-17513 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 17513 |
---|---|
container_issue | 12 |
container_start_page | 17487 |
container_title | Multimedia tools and applications |
container_volume | 80 |
creator | Trojahn, Tiago Henrique Goularte, Rudinei |
description | The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset. |
doi_str_mv | 10.1007/s11042-020-10450-2 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2529604495</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2529604495</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</originalsourceid><addsrcrecordid>eNp9UMtKxEAQHETBdfUHPAU8j3bPI5McZfEFC17W8zCbdEKW7CTOJIJ_72gEb576VVVdFGPXCLcIYO4iIijBQQBPjQYuTtgKtZHcGIGnqZcFcKMBz9lFjAcAzLVQK6Z3dByH4Prso6tpyGJFnrJI7ZH85KZu8NkcO99mNdHIe3LBp-mSnTWuj3T1W9fs7fFht3nm29enl839llcSy4mLGpVRycG-2BvIJeaVU1gWJZimAgWiVs5oU2hNRmHaVQWoPLmUoFw6yjW7WXTHMLzPFCd7GObg00srtChzUKrUCSUWVBWGGAM1dgzd0YVPi2C_47FLPDbFY3_isSKR5EKKCexbCn_S_7C-AE66ZTQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2529604495</pqid></control><display><type>article</type><title>Temporal video scene segmentation using deep-learning</title><source>SpringerLink Journals - AutoHoldings</source><creator>Trojahn, Tiago Henrique ; Goularte, Rudinei</creator><creatorcontrib>Trojahn, Tiago Henrique ; Goularte, Rudinei</creatorcontrib><description>The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.</description><identifier>ISSN: 1380-7501</identifier><identifier>EISSN: 1573-7721</identifier><identifier>DOI: 10.1007/s11042-020-10450-2</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial neural networks ; Computer Communication Networks ; Computer Science ; Data Structures and Information Theory ; Deep learning ; Feature extraction ; Multimedia Information Systems ; Neural networks ; Recurrent neural networks ; Segmentation ; Segments ; Special Purpose and Application-Based Systems</subject><ispartof>Multimedia tools and applications, 2021-05, Vol.80 (12), p.17487-17513</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021</rights><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</citedby><cites>FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</cites><orcidid>0000-0002-1826-4456</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11042-020-10450-2$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11042-020-10450-2$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Trojahn, Tiago Henrique</creatorcontrib><creatorcontrib>Goularte, Rudinei</creatorcontrib><title>Temporal video scene segmentation using deep-learning</title><title>Multimedia tools and applications</title><addtitle>Multimed Tools Appl</addtitle><description>The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.</description><subject>Artificial neural networks</subject><subject>Computer Communication Networks</subject><subject>Computer Science</subject><subject>Data Structures and Information Theory</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Multimedia Information Systems</subject><subject>Neural networks</subject><subject>Recurrent neural networks</subject><subject>Segmentation</subject><subject>Segments</subject><subject>Special Purpose and Application-Based Systems</subject><issn>1380-7501</issn><issn>1573-7721</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNp9UMtKxEAQHETBdfUHPAU8j3bPI5McZfEFC17W8zCbdEKW7CTOJIJ_72gEb576VVVdFGPXCLcIYO4iIijBQQBPjQYuTtgKtZHcGIGnqZcFcKMBz9lFjAcAzLVQK6Z3dByH4Prso6tpyGJFnrJI7ZH85KZu8NkcO99mNdHIe3LBp-mSnTWuj3T1W9fs7fFht3nm29enl839llcSy4mLGpVRycG-2BvIJeaVU1gWJZimAgWiVs5oU2hNRmHaVQWoPLmUoFw6yjW7WXTHMLzPFCd7GObg00srtChzUKrUCSUWVBWGGAM1dgzd0YVPi2C_47FLPDbFY3_isSKR5EKKCexbCn_S_7C-AE66ZTQ</recordid><startdate>20210501</startdate><enddate>20210501</enddate><creator>Trojahn, Tiago Henrique</creator><creator>Goularte, Rudinei</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0002-1826-4456</orcidid></search><sort><creationdate>20210501</creationdate><title>Temporal video scene segmentation using deep-learning</title><author>Trojahn, Tiago Henrique ; Goularte, Rudinei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-2d1474772b8b706316ca4198907fc0402d4a757855e7417fcc8046750304a2d43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Artificial neural networks</topic><topic>Computer Communication Networks</topic><topic>Computer Science</topic><topic>Data Structures and Information Theory</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Multimedia Information Systems</topic><topic>Neural networks</topic><topic>Recurrent neural networks</topic><topic>Segmentation</topic><topic>Segments</topic><topic>Special Purpose and Application-Based Systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Trojahn, Tiago Henrique</creatorcontrib><creatorcontrib>Goularte, Rudinei</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>ABI/INFORM Collection</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>One Business (ProQuest)</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><jtitle>Multimedia tools and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Trojahn, Tiago Henrique</au><au>Goularte, Rudinei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Temporal video scene segmentation using deep-learning</atitle><jtitle>Multimedia tools and applications</jtitle><stitle>Multimed Tools Appl</stitle><date>2021-05-01</date><risdate>2021</risdate><volume>80</volume><issue>12</issue><spage>17487</spage><epage>17513</epage><pages>17487-17513</pages><issn>1380-7501</issn><eissn>1573-7721</eissn><abstract>The automatic temporal video scene segmentation (also known as video story segmentation) is still an open problem without definite solutions in most cases. Among the available techniques, the ones which shows better results are multimodal using features extracted from multiple modalities. Multimodal fusion may be performed to fuse each modality as a single representation (early fusion) or by each modality segmentation (late fusion), the latter been widely due to multimodal fusion simplicity. Recently, deep learning techniques such as convolutional neural networks (CNN) has been successfully employed to extract features from multiple data sources, easing the development of early fusion methods. However, CNNs cannot adequately learn cues which are temporally distributed along the video due to difficulties to model temporal features data dependencies. A particular deep learning approach which can learn such cues is the recurrent neural network (RNN). Successfully employed on text processing, RNNs are fitted to analyze sequences of data of variable length and may better grasp the temporal relationship among low-level features of video segments, hopefully obtaining more accurate scene boundary detection. This paper goes beyond direct applying RNNs and proposes a new multimodal approach to temporally segment a video into scenes. This approach builds a new architecture carefully combining CNN and RNN capabilities, obtaining better efficacy results on the task when compared with related techniques on a public video dataset.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11042-020-10450-2</doi><tpages>27</tpages><orcidid>https://orcid.org/0000-0002-1826-4456</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1380-7501 |
ispartof | Multimedia tools and applications, 2021-05, Vol.80 (12), p.17487-17513 |
issn | 1380-7501 1573-7721 |
language | eng |
recordid | cdi_proquest_journals_2529604495 |
source | SpringerLink Journals - AutoHoldings |
subjects | Artificial neural networks Computer Communication Networks Computer Science Data Structures and Information Theory Deep learning Feature extraction Multimedia Information Systems Neural networks Recurrent neural networks Segmentation Segments Special Purpose and Application-Based Systems |
title | Temporal video scene segmentation using deep-learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T00%3A29%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Temporal%20video%20scene%20segmentation%20using%20deep-learning&rft.jtitle=Multimedia%20tools%20and%20applications&rft.au=Trojahn,%20Tiago%20Henrique&rft.date=2021-05-01&rft.volume=80&rft.issue=12&rft.spage=17487&rft.epage=17513&rft.pages=17487-17513&rft.issn=1380-7501&rft.eissn=1573-7721&rft_id=info:doi/10.1007/s11042-020-10450-2&rft_dat=%3Cproquest_cross%3E2529604495%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2529604495&rft_id=info:pmid/&rfr_iscdi=true |