Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection

Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inabil...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2022-05, Vol.32 (5), p.2962-2975
Hauptverfasser: Chen, Yaosen, Guo, Bing, Shen, Yan, Wang, Wei, Lu, Weichen, Suo, Xinhua
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 2975
container_issue 5
container_start_page 2962
container_title IEEE transactions on circuits and systems for video technology
container_volume 32
creator Chen, Yaosen
Guo, Bing
Shen, Yan
Wang, Wei
Lu, Weichen
Suo, Xinhua
description Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.
doi_str_mv 10.1109/TCSVT.2021.3104226
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TCSVT_2021_3104226</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9512048</ieee_id><sourcerecordid>2659345409</sourcerecordid><originalsourceid>FETCH-LOGICAL-c295t-78a9bbe231c13f94d4909ffec7230bfe14ae9b16876e4f6109d1c9dcfaeb312c3</originalsourceid><addsrcrecordid>eNo9kFtLw0AQhYMoWKt_QF8WfE7d2UuSfaypNygKGvVFCMlmVlPTbN1NlP57U1t8msPMOcPMFwSnQCcAVF1k6dNLNmGUwYQDFYxFe8EIpExCxqjcHzSVECYM5GFw5P2CUhCJiEfBW1qsfN8gubR9WxVuTe6x-7Huk7zW3QfhM5La9ts2fVfbtmjIbN0Wy1qTRzt02ndirCMZLlfWDcOp3rjIDDv8U8fBgSkajye7Og6er6-y9DacP9zcpdN5qJmSXRgnhSpLZBw0cKNEJRRVxqCOGaelQRAFqhKiJI5QmGj4twKtKm0KLDkwzcfB-XbvytmvHn2XL2zvhnN9ziKpuJCCqsHFti7trPcOTb5y9XJ4OQeabyjmfxTzDcV8R3EInW1DNSL-B5QERkXCfwE4bm8X</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2659345409</pqid></control><display><type>article</type><title>Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Chen, Yaosen ; Guo, Bing ; Shen, Yan ; Wang, Wei ; Lu, Weichen ; Suo, Xinhua</creator><creatorcontrib>Chen, Yaosen ; Guo, Bing ; Shen, Yan ; Wang, Wei ; Lu, Weichen ; Suo, Xinhua</creatorcontrib><description>Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2021.3104226</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Algorithms ; Artificial neural networks ; capsule network ; Feature extraction ; Feature maps ; Heuristic algorithms ; Modules ; Optical flow (image analysis) ; Proposals ; Routing ; Task analysis ; Temporal action detection ; temporal action proposals ; Tensors ; Three-dimensional displays ; video features</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2022-05, Vol.32 (5), p.2962-2975</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c295t-78a9bbe231c13f94d4909ffec7230bfe14ae9b16876e4f6109d1c9dcfaeb312c3</citedby><cites>FETCH-LOGICAL-c295t-78a9bbe231c13f94d4909ffec7230bfe14ae9b16876e4f6109d1c9dcfaeb312c3</cites><orcidid>0000-0002-0679-4601 ; 0000-0001-8141-8430 ; 0000-0002-7212-1755</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9512048$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9512048$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Chen, Yaosen</creatorcontrib><creatorcontrib>Guo, Bing</creatorcontrib><creatorcontrib>Shen, Yan</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Lu, Weichen</creatorcontrib><creatorcontrib>Suo, Xinhua</creatorcontrib><title>Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.</description><subject>Algorithms</subject><subject>Artificial neural networks</subject><subject>capsule network</subject><subject>Feature extraction</subject><subject>Feature maps</subject><subject>Heuristic algorithms</subject><subject>Modules</subject><subject>Optical flow (image analysis)</subject><subject>Proposals</subject><subject>Routing</subject><subject>Task analysis</subject><subject>Temporal action detection</subject><subject>temporal action proposals</subject><subject>Tensors</subject><subject>Three-dimensional displays</subject><subject>video features</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kFtLw0AQhYMoWKt_QF8WfE7d2UuSfaypNygKGvVFCMlmVlPTbN1NlP57U1t8msPMOcPMFwSnQCcAVF1k6dNLNmGUwYQDFYxFe8EIpExCxqjcHzSVECYM5GFw5P2CUhCJiEfBW1qsfN8gubR9WxVuTe6x-7Huk7zW3QfhM5La9ts2fVfbtmjIbN0Wy1qTRzt02ndirCMZLlfWDcOp3rjIDDv8U8fBgSkajye7Og6er6-y9DacP9zcpdN5qJmSXRgnhSpLZBw0cKNEJRRVxqCOGaelQRAFqhKiJI5QmGj4twKtKm0KLDkwzcfB-XbvytmvHn2XL2zvhnN9ziKpuJCCqsHFti7trPcOTb5y9XJ4OQeabyjmfxTzDcV8R3EInW1DNSL-B5QERkXCfwE4bm8X</recordid><startdate>20220501</startdate><enddate>20220501</enddate><creator>Chen, Yaosen</creator><creator>Guo, Bing</creator><creator>Shen, Yan</creator><creator>Wang, Wei</creator><creator>Lu, Weichen</creator><creator>Suo, Xinhua</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-0679-4601</orcidid><orcidid>https://orcid.org/0000-0001-8141-8430</orcidid><orcidid>https://orcid.org/0000-0002-7212-1755</orcidid></search><sort><creationdate>20220501</creationdate><title>Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection</title><author>Chen, Yaosen ; Guo, Bing ; Shen, Yan ; Wang, Wei ; Lu, Weichen ; Suo, Xinhua</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c295t-78a9bbe231c13f94d4909ffec7230bfe14ae9b16876e4f6109d1c9dcfaeb312c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Artificial neural networks</topic><topic>capsule network</topic><topic>Feature extraction</topic><topic>Feature maps</topic><topic>Heuristic algorithms</topic><topic>Modules</topic><topic>Optical flow (image analysis)</topic><topic>Proposals</topic><topic>Routing</topic><topic>Task analysis</topic><topic>Temporal action detection</topic><topic>temporal action proposals</topic><topic>Tensors</topic><topic>Three-dimensional displays</topic><topic>video features</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chen, Yaosen</creatorcontrib><creatorcontrib>Guo, Bing</creatorcontrib><creatorcontrib>Shen, Yan</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Lu, Weichen</creatorcontrib><creatorcontrib>Suo, Xinhua</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Yaosen</au><au>Guo, Bing</au><au>Shen, Yan</au><au>Wang, Wei</au><au>Lu, Weichen</au><au>Suo, Xinhua</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2022-05-01</date><risdate>2022</risdate><volume>32</volume><issue>5</issue><spage>2962</spage><epage>2975</epage><pages>2962-2975</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2021.3104226</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-0679-4601</orcidid><orcidid>https://orcid.org/0000-0001-8141-8430</orcidid><orcidid>https://orcid.org/0000-0002-7212-1755</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2022-05, Vol.32 (5), p.2962-2975
issn 1051-8215
1558-2205
language eng
recordid cdi_crossref_primary_10_1109_TCSVT_2021_3104226
source IEEE Electronic Library (IEL)
subjects Algorithms
Artificial neural networks
capsule network
Feature extraction
Feature maps
Heuristic algorithms
Modules
Optical flow (image analysis)
Proposals
Routing
Task analysis
Temporal action detection
temporal action proposals
Tensors
Three-dimensional displays
video features
title Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T12%3A15%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Capsule%20Boundary%20Network%20With%203D%20Convolutional%20Dynamic%20Routing%20for%20Temporal%20Action%20Detection&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Chen,%20Yaosen&rft.date=2022-05-01&rft.volume=32&rft.issue=5&rft.spage=2962&rft.epage=2975&rft.pages=2962-2975&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2021.3104226&rft_dat=%3Cproquest_RIE%3E2659345409%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2659345409&rft_id=info:pmid/&rft_ieee_id=9512048&rfr_iscdi=true