Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection

Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inabil...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2022-05, Vol.32 (5), p.2962-2975
Hauptverfasser:	Chen, Yaosen, Guo, Bing, Shen, Yan, Wang, Wei, Lu, Weichen, Suo, Xinhua
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial neural networks capsule network Feature extraction Feature maps Heuristic algorithms Modules Optical flow (image analysis) Proposals Routing Task analysis Temporal action detection temporal action proposals Tensors Three-dimensional displays video features
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	2975
container_issue	5
container_start_page	2962
container_title	IEEE transactions on circuits and systems for video technology
container_volume	32
creator	Chen, Yaosen Guo, Bing Shen, Yan Wang, Wei Lu, Weichen Suo, Xinhua
description	Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.
doi_str_mv	10.1109/TCSVT.2021.3104226
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TCSVT_2021_3104226</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9512048</ieee_id><sourcerecordid>2659345409</sourcerecordid><originalsourceid>FETCH-LOGICAL-c295t-78a9bbe231c13f94d4909ffec7230bfe14ae9b16876e4f6109d1c9dcfaeb312c3</originalsourceid><addsrcrecordid>eNo9kFtLw0AQhYMoWKt_QF8WfE7d2UuSfaypNygKGvVFCMlmVlPTbN1NlP57U1t8msPMOcPMFwSnQCcAVF1k6dNLNmGUwYQDFYxFe8EIpExCxqjcHzSVECYM5GFw5P2CUhCJiEfBW1qsfN8gubR9WxVuTe6x-7Huk7zW3QfhM5La9ts2fVfbtmjIbN0Wy1qTRzt02ndirCMZLlfWDcOp3rjIDDv8U8fBgSkajye7Og6er6-y9DacP9zcpdN5qJmSXRgnhSpLZBw0cKNEJRRVxqCOGaelQRAFqhKiJI5QmGj4twKtKm0KLDkwzcfB-XbvytmvHn2XL2zvhnN9ziKpuJCCqsHFti7trPcOTb5y9XJ4OQeabyjmfxTzDcV8R3EInW1DNSL-B5QERkXCfwE4bm8X</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2659345409</pqid></control><display><type>article</type><title>Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Chen, Yaosen ; Guo, Bing ; Shen, Yan ; Wang, Wei ; Lu, Weichen ; Suo, Xinhua</creator><creatorcontrib>Chen, Yaosen ; Guo, Bing ; Shen, Yan ; Wang, Wei ; Lu, Weichen ; Suo, Xinhua</creatorcontrib><description>Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2021.3104226</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Algorithms ; Artificial neural networks ; capsule network ; Feature extraction ; Feature maps ; Heuristic algorithms ; Modules ; Optical flow (image analysis) ; Proposals ; Routing ; Task analysis ; Temporal action detection ; temporal action proposals ; Tensors ; Three-dimensional displays ; video features</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2022-05, Vol.32 (5), p.2962-2975</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c295t-78a9bbe231c13f94d4909ffec7230bfe14ae9b16876e4f6109d1c9dcfaeb312c3</citedby><cites>FETCH-LOGICAL-c295t-78a9bbe231c13f94d4909ffec7230bfe14ae9b16876e4f6109d1c9dcfaeb312c3</cites><orcidid>0000-0002-0679-4601 ; 0000-0001-8141-8430 ; 0000-0002-7212-1755</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9512048$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9512048$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Chen, Yaosen</creatorcontrib><creatorcontrib>Guo, Bing</creatorcontrib><creatorcontrib>Shen, Yan</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Lu, Weichen</creatorcontrib><creatorcontrib>Suo, Xinhua</creatorcontrib><title>Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.</description><subject>Algorithms</subject><subject>Artificial neural networks</subject><subject>capsule network</subject><subject>Feature extraction</subject><subject>Feature maps</subject><subject>Heuristic algorithms</subject><subject>Modules</subject><subject>Optical flow (image analysis)</subject><subject>Proposals</subject><subject>Routing</subject><subject>Task analysis</subject><subject>Temporal action detection</subject><subject>temporal action proposals</subject><subject>Tensors</subject><subject>Three-dimensional displays</subject><subject>video features</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kFtLw0AQhYMoWKt_QF8WfE7d2UuSfaypNygKGvVFCMlmVlPTbN1NlP57U1t8msPMOcPMFwSnQCcAVF1k6dNLNmGUwYQDFYxFe8EIpExCxqjcHzSVECYM5GFw5P2CUhCJiEfBW1qsfN8gubR9WxVuTe6x-7Huk7zW3QfhM5La9ts2fVfbtmjIbN0Wy1qTRzt02ndirCMZLlfWDcOp3rjIDDv8U8fBgSkajye7Og6er6-y9DacP9zcpdN5qJmSXRgnhSpLZBw0cKNEJRRVxqCOGaelQRAFqhKiJI5QmGj4twKtKm0KLDkwzcfB-XbvytmvHn2XL2zvhnN9ziKpuJCCqsHFti7trPcOTb5y9XJ4OQeabyjmfxTzDcV8R3EInW1DNSL-B5QERkXCfwE4bm8X</recordid><startdate>20220501</startdate><enddate>20220501</enddate><creator>Chen, Yaosen</creator><creator>Guo, Bing</creator><creator>Shen, Yan</creator><creator>Wang, Wei</creator><creator>Lu, Weichen</creator><creator>Suo, Xinhua</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-0679-4601</orcidid><orcidid>https://orcid.org/0000-0001-8141-8430</orcidid><orcidid>https://orcid.org/0000-0002-7212-1755</orcidid></search><sort><creationdate>20220501</creationdate><title>Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection</title><author>Chen, Yaosen ; Guo, Bing ; Shen, Yan ; Wang, Wei ; Lu, Weichen ; Suo, Xinhua</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c295t-78a9bbe231c13f94d4909ffec7230bfe14ae9b16876e4f6109d1c9dcfaeb312c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Artificial neural networks</topic><topic>capsule network</topic><topic>Feature extraction</topic><topic>Feature maps</topic><topic>Heuristic algorithms</topic><topic>Modules</topic><topic>Optical flow (image analysis)</topic><topic>Proposals</topic><topic>Routing</topic><topic>Task analysis</topic><topic>Temporal action detection</topic><topic>temporal action proposals</topic><topic>Tensors</topic><topic>Three-dimensional displays</topic><topic>video features</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chen, Yaosen</creatorcontrib><creatorcontrib>Guo, Bing</creatorcontrib><creatorcontrib>Shen, Yan</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Lu, Weichen</creatorcontrib><creatorcontrib>Suo, Xinhua</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Yaosen</au><au>Guo, Bing</au><au>Shen, Yan</au><au>Wang, Wei</au><au>Lu, Weichen</au><au>Suo, Xinhua</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2022-05-01</date><risdate>2022</risdate><volume>32</volume><issue>5</issue><spage>2962</spage><epage>2975</epage><pages>2962-2975</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2021.3104226</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-0679-4601</orcidid><orcidid>https://orcid.org/0000-0001-8141-8430</orcidid><orcidid>https://orcid.org/0000-0002-7212-1755</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2022-05, Vol.32 (5), p.2962-2975
issn	1051-8215 1558-2205
language	eng
recordid	cdi_crossref_primary_10_1109_TCSVT_2021_3104226
source	IEEE Electronic Library (IEL)
subjects	Algorithms Artificial neural networks capsule network Feature extraction Feature maps Heuristic algorithms Modules Optical flow (image analysis) Proposals Routing Task analysis Temporal action detection temporal action proposals Tensors Three-dimensional displays video features
title	Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T12%3A15%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Capsule%20Boundary%20Network%20With%203D%20Convolutional%20Dynamic%20Routing%20for%20Temporal%20Action%20Detection&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Chen,%20Yaosen&rft.date=2022-05-01&rft.volume=32&rft.issue=5&rft.spage=2962&rft.epage=2975&rft.pages=2962-2975&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2021.3104226&rft_dat=%3Cproquest_RIE%3E2659345409%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2659345409&rft_id=info:pmid/&rft_ieee_id=9512048&rfr_iscdi=true