Weakly Supervised Temporal Action Detection With Temporal Dependency Learning

Weakly supervised temporal action detection aims at localizing temporal positions of action instances in untrimmed videos with only action class labels. In general, previous methods individually classify each frame based on the appearance information and the short-term motion information, and then i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2022-07, Vol.32 (7), p.4473-4485
Hauptverfasser:	Li, Bairong, Liu, Ruixin, Chen, Tianquan, Zhu, Yuesheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Frames Labels Learning Proposals Segments Semantics Task analysis temporal action detection Training transformer Transformers Videos Weakly supervised learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	4485
container_issue	7
container_start_page	4473
container_title	IEEE transactions on circuits and systems for video technology
container_volume	32
creator	Li, Bairong Liu, Ruixin Chen, Tianquan Zhu, Yuesheng
description	Weakly supervised temporal action detection aims at localizing temporal positions of action instances in untrimmed videos with only action class labels. In general, previous methods individually classify each frame based on the appearance information and the short-term motion information, and then integrate consecutive high-response action frames into entities which serve as detected action instances. However, the long-range temporal dependencies between action frames are not fully utilized, and the detection results are more likely to be trapped in the most discriminative action segments. To alleviate this issue, we propose a novel two-branch (i.e., the coarse detection branch and the refining detection branch) detection framework with learning the long-range temporal dependencies for obtaining more accurate detection results, where only action class labels are required. The coarse detection branch is used to localize the most discriminative segments of action instances based on a typical multi-instance learning paradigm under the supervision of action class labels, whereas the refining detection branch is expected to localize the less discriminative segments of action instances via learning the long-range temporal dependencies between frames based on the proposed Transformer-style architecture and learning strategies. This collaboration mechanism takes full advantage of complementary information from the provided action class labels and the natural temporal dependencies between action frames, forming a more comprehensive solution. Consequently, our method obtains more precise detection results. Expectedly, the proposed method outperforms recent weakly supervised temporal action detection methods on dataset THUMOS14 and ActivityNet measured by mAP@tIoU and AR@AN.
doi_str_mv	10.1109/TCSVT.2021.3125701
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9605583</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9605583</ieee_id><sourcerecordid>2682916437</sourcerecordid><originalsourceid>FETCH-LOGICAL-c225t-b39920b1497467a2d6c4998ab6313bb70e3825ce355e6d1fe6d1ba29de9771083</originalsourceid><addsrcrecordid>eNpFkEFPwkAQhTdGExH9A3pp4rm4M9vtdo8ERE0wHqhy3GzbQYvQ1t1iwr-3WKKXmZfMe_OSj7Fr4CMAru_SyeItHSFHGAlAqTicsAFImYSIXJ52mksIEwR5zi68X3MOURKpAXtekv3c7IPFriH3XXoqgpS2Te3sJhjnbVlXwZRa6tWybD_-z1NqqCqoyvfBnKyryur9kp2t7MbT1XEP2evsPp08hvOXh6fJeB7miLINM6E18gwiraJYWSziPNI6sVksQGSZ4iQSlDkJKSkuYHUYmUVdkFYKeCKG7Lb_27j6a0e-Net656qu0mCcoIY4EqpzYe_KXe29o5VpXLm1bm-AmwM284vNHLCZI7YudNOHSiL6C-iYdzCF-AFWLWic</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2682916437</pqid></control><display><type>article</type><title>Weakly Supervised Temporal Action Detection With Temporal Dependency Learning</title><source>IEEE Electronic Library Online</source><creator>Li, Bairong ; Liu, Ruixin ; Chen, Tianquan ; Zhu, Yuesheng</creator><creatorcontrib>Li, Bairong ; Liu, Ruixin ; Chen, Tianquan ; Zhu, Yuesheng</creatorcontrib><description>Weakly supervised temporal action detection aims at localizing temporal positions of action instances in untrimmed videos with only action class labels. In general, previous methods individually classify each frame based on the appearance information and the short-term motion information, and then integrate consecutive high-response action frames into entities which serve as detected action instances. However, the long-range temporal dependencies between action frames are not fully utilized, and the detection results are more likely to be trapped in the most discriminative action segments. To alleviate this issue, we propose a novel two-branch (i.e., the coarse detection branch and the refining detection branch) detection framework with learning the long-range temporal dependencies for obtaining more accurate detection results, where only action class labels are required. The coarse detection branch is used to localize the most discriminative segments of action instances based on a typical multi-instance learning paradigm under the supervision of action class labels, whereas the refining detection branch is expected to localize the less discriminative segments of action instances via learning the long-range temporal dependencies between frames based on the proposed Transformer-style architecture and learning strategies. This collaboration mechanism takes full advantage of complementary information from the provided action class labels and the natural temporal dependencies between action frames, forming a more comprehensive solution. Consequently, our method obtains more precise detection results. Expectedly, the proposed method outperforms recent weakly supervised temporal action detection methods on dataset THUMOS14 and ActivityNet measured by mAP@tIoU and AR@AN.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2021.3125701</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Annotations ; Frames ; Labels ; Learning ; Proposals ; Segments ; Semantics ; Task analysis ; temporal action detection ; Training ; transformer ; Transformers ; Videos ; Weakly supervised learning</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2022-07, Vol.32 (7), p.4473-4485</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c225t-b39920b1497467a2d6c4998ab6313bb70e3825ce355e6d1fe6d1ba29de9771083</citedby><cites>FETCH-LOGICAL-c225t-b39920b1497467a2d6c4998ab6313bb70e3825ce355e6d1fe6d1ba29de9771083</cites><orcidid>0000-0002-5060-0229 ; 0000-0003-2524-6800 ; 0000-0001-7370-7048 ; 0000-0002-9911-9233</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9605583$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9605583$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Li, Bairong</creatorcontrib><creatorcontrib>Liu, Ruixin</creatorcontrib><creatorcontrib>Chen, Tianquan</creatorcontrib><creatorcontrib>Zhu, Yuesheng</creatorcontrib><title>Weakly Supervised Temporal Action Detection With Temporal Dependency Learning</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Weakly supervised temporal action detection aims at localizing temporal positions of action instances in untrimmed videos with only action class labels. In general, previous methods individually classify each frame based on the appearance information and the short-term motion information, and then integrate consecutive high-response action frames into entities which serve as detected action instances. However, the long-range temporal dependencies between action frames are not fully utilized, and the detection results are more likely to be trapped in the most discriminative action segments. To alleviate this issue, we propose a novel two-branch (i.e., the coarse detection branch and the refining detection branch) detection framework with learning the long-range temporal dependencies for obtaining more accurate detection results, where only action class labels are required. The coarse detection branch is used to localize the most discriminative segments of action instances based on a typical multi-instance learning paradigm under the supervision of action class labels, whereas the refining detection branch is expected to localize the less discriminative segments of action instances via learning the long-range temporal dependencies between frames based on the proposed Transformer-style architecture and learning strategies. This collaboration mechanism takes full advantage of complementary information from the provided action class labels and the natural temporal dependencies between action frames, forming a more comprehensive solution. Consequently, our method obtains more precise detection results. Expectedly, the proposed method outperforms recent weakly supervised temporal action detection methods on dataset THUMOS14 and ActivityNet measured by mAP@tIoU and AR@AN.</description><subject>Annotations</subject><subject>Frames</subject><subject>Labels</subject><subject>Learning</subject><subject>Proposals</subject><subject>Segments</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>temporal action detection</subject><subject>Training</subject><subject>transformer</subject><subject>Transformers</subject><subject>Videos</subject><subject>Weakly supervised learning</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpFkEFPwkAQhTdGExH9A3pp4rm4M9vtdo8ERE0wHqhy3GzbQYvQ1t1iwr-3WKKXmZfMe_OSj7Fr4CMAru_SyeItHSFHGAlAqTicsAFImYSIXJ52mksIEwR5zi68X3MOURKpAXtekv3c7IPFriH3XXoqgpS2Te3sJhjnbVlXwZRa6tWybD_-z1NqqCqoyvfBnKyryur9kp2t7MbT1XEP2evsPp08hvOXh6fJeB7miLINM6E18gwiraJYWSziPNI6sVksQGSZ4iQSlDkJKSkuYHUYmUVdkFYKeCKG7Lb_27j6a0e-Net656qu0mCcoIY4EqpzYe_KXe29o5VpXLm1bm-AmwM284vNHLCZI7YudNOHSiL6C-iYdzCF-AFWLWic</recordid><startdate>20220701</startdate><enddate>20220701</enddate><creator>Li, Bairong</creator><creator>Liu, Ruixin</creator><creator>Chen, Tianquan</creator><creator>Zhu, Yuesheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-5060-0229</orcidid><orcidid>https://orcid.org/0000-0003-2524-6800</orcidid><orcidid>https://orcid.org/0000-0001-7370-7048</orcidid><orcidid>https://orcid.org/0000-0002-9911-9233</orcidid></search><sort><creationdate>20220701</creationdate><title>Weakly Supervised Temporal Action Detection With Temporal Dependency Learning</title><author>Li, Bairong ; Liu, Ruixin ; Chen, Tianquan ; Zhu, Yuesheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c225t-b39920b1497467a2d6c4998ab6313bb70e3825ce355e6d1fe6d1ba29de9771083</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Annotations</topic><topic>Frames</topic><topic>Labels</topic><topic>Learning</topic><topic>Proposals</topic><topic>Segments</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>temporal action detection</topic><topic>Training</topic><topic>transformer</topic><topic>Transformers</topic><topic>Videos</topic><topic>Weakly supervised learning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Bairong</creatorcontrib><creatorcontrib>Liu, Ruixin</creatorcontrib><creatorcontrib>Chen, Tianquan</creatorcontrib><creatorcontrib>Zhu, Yuesheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Electronic Library Online</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Bairong</au><au>Liu, Ruixin</au><au>Chen, Tianquan</au><au>Zhu, Yuesheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Weakly Supervised Temporal Action Detection With Temporal Dependency Learning</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2022-07-01</date><risdate>2022</risdate><volume>32</volume><issue>7</issue><spage>4473</spage><epage>4485</epage><pages>4473-4485</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Weakly supervised temporal action detection aims at localizing temporal positions of action instances in untrimmed videos with only action class labels. In general, previous methods individually classify each frame based on the appearance information and the short-term motion information, and then integrate consecutive high-response action frames into entities which serve as detected action instances. However, the long-range temporal dependencies between action frames are not fully utilized, and the detection results are more likely to be trapped in the most discriminative action segments. To alleviate this issue, we propose a novel two-branch (i.e., the coarse detection branch and the refining detection branch) detection framework with learning the long-range temporal dependencies for obtaining more accurate detection results, where only action class labels are required. The coarse detection branch is used to localize the most discriminative segments of action instances based on a typical multi-instance learning paradigm under the supervision of action class labels, whereas the refining detection branch is expected to localize the less discriminative segments of action instances via learning the long-range temporal dependencies between frames based on the proposed Transformer-style architecture and learning strategies. This collaboration mechanism takes full advantage of complementary information from the provided action class labels and the natural temporal dependencies between action frames, forming a more comprehensive solution. Consequently, our method obtains more precise detection results. Expectedly, the proposed method outperforms recent weakly supervised temporal action detection methods on dataset THUMOS14 and ActivityNet measured by mAP@tIoU and AR@AN.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2021.3125701</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-5060-0229</orcidid><orcidid>https://orcid.org/0000-0003-2524-6800</orcidid><orcidid>https://orcid.org/0000-0001-7370-7048</orcidid><orcidid>https://orcid.org/0000-0002-9911-9233</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2022-07, Vol.32 (7), p.4473-4485
issn	1051-8215 1558-2205
language	eng
recordid	cdi_ieee_primary_9605583
source	IEEE Electronic Library Online
subjects	Annotations Frames Labels Learning Proposals Segments Semantics Task analysis temporal action detection Training transformer Transformers Videos Weakly supervised learning
title	Weakly Supervised Temporal Action Detection With Temporal Dependency Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T10%3A09%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Weakly%20Supervised%20Temporal%20Action%20Detection%20With%20Temporal%20Dependency%20Learning&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Li,%20Bairong&rft.date=2022-07-01&rft.volume=32&rft.issue=7&rft.spage=4473&rft.epage=4485&rft.pages=4473-4485&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2021.3125701&rft_dat=%3Cproquest_RIE%3E2682916437%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2682916437&rft_id=info:pmid/&rft_ieee_id=9605583&rfr_iscdi=true