Deep Motion Prior for Weakly-Supervised Temporal Action Localization
Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the vide...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on image processing 2022, Vol.31, p.1-1 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE transactions on image processing |
container_volume | 31 |
creator | Cao, Meng Zhang, Can Chen, Long Shou, Mike Zheng Zou, Yuexian |
description | Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the video-level prediction. However, we argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss. In this paper, we analyze that the motion cues behind the optical flow features are complementary informative. Inspired by this, we propose to build a context-dependent motion prior, termed as motionness . Specifically, a motion graph is introduced to model motionness based on the local motion carrier ( e.g ., optical flow). In addition, to highlight more informative video snippets, a motion-guided loss is proposed to modulate the network training conditioned on motionness scores. Extensive ablation studies confirm that motionness efficaciously models action-of-interest, and the motion-guided loss leads to more accurate results. Besides, our motion-guided loss is a plug-and-play loss function and is applicable with existing WSTAL methods. Without loss of generality, based on the standard MIL pipeline, our method achieves new state-of-the-art performance on three challenging benchmarks, including THUMOS'14, ActivityNet v1.2 and v1.3. |
doi_str_mv | 10.1109/TIP.2022.3193752 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TIP_2022_3193752</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9846868</ieee_id><sourcerecordid>2698813901</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-910792db69a071910489ea9b7e351c438ced977c4c4c33e2d76500373f7989ff3</originalsourceid><addsrcrecordid>eNo9kNFLwzAQxoMoOKfvgi8FnztzSdrkHsfmdDBx4MTHkLVX6OyWmm7C_OvN3JDjuPvg--7gx9gt8AEAx4fFdD4QXIiBBJQ6E2esB6gg5VyJ87jzTKcaFF6yq65bcQ4qg7zHxmOiNnnx29pvknmofUiq2B_kPpt9-rZrKXzXHZXJgtatD65JhsWfd-YL19Q_7iCu2UXlmo5uTrPP3iePi9FzOnt9mo6Gs7QQCNsUgWsU5TJHxzVEpQySw6UmmUGhpCmoRK0LFUtKEqXOM86llpVGg1Ul--z-eLcN_mtH3dau_C5s4ksrcjQGJHKILn50FcF3XaDKtqFeu7C3wO2BlY2s7IGVPbGKkbtjpCaifzsalZvcyF_122Op</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2698813901</pqid></control><display><type>article</type><title>Deep Motion Prior for Weakly-Supervised Temporal Action Localization</title><source>IEEE Electronic Library (IEL)</source><creator>Cao, Meng ; Zhang, Can ; Chen, Long ; Shou, Mike Zheng ; Zou, Yuexian</creator><creatorcontrib>Cao, Meng ; Zhang, Can ; Chen, Long ; Shou, Mike Zheng ; Zou, Yuexian</creatorcontrib><description>Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the video-level prediction. However, we argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss. In this paper, we analyze that the motion cues behind the optical flow features are complementary informative. Inspired by this, we propose to build a context-dependent motion prior, termed as motionness . Specifically, a motion graph is introduced to model motionness based on the local motion carrier ( e.g ., optical flow). In addition, to highlight more informative video snippets, a motion-guided loss is proposed to modulate the network training conditioned on motionness scores. Extensive ablation studies confirm that motionness efficaciously models action-of-interest, and the motion-guided loss leads to more accurate results. Besides, our motion-guided loss is a plug-and-play loss function and is applicable with existing WSTAL methods. Without loss of generality, based on the standard MIL pipeline, our method achieves new state-of-the-art performance on three challenging benchmarks, including THUMOS'14, ActivityNet v1.2 and v1.3.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2022.3193752</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Ablation ; Adaptive optics ; Deep Motion Prior ; Feature extraction ; Incompatibility ; Localization ; Location awareness ; Motion-guided Loss ; Optical flow (image analysis) ; Optical imaging ; Optical losses ; Training ; Videos ; Weakly-Supervised Temporal Action Localization (WSTAL) ; Xenon</subject><ispartof>IEEE transactions on image processing, 2022, Vol.31, p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-910792db69a071910489ea9b7e351c438ced977c4c4c33e2d76500373f7989ff3</citedby><cites>FETCH-LOGICAL-c291t-910792db69a071910489ea9b7e351c438ced977c4c4c33e2d76500373f7989ff3</cites><orcidid>0000-0002-8946-4228 ; 0000-0001-9530-5218 ; 0000-0001-6148-9709</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9846868$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9846868$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Cao, Meng</creatorcontrib><creatorcontrib>Zhang, Can</creatorcontrib><creatorcontrib>Chen, Long</creatorcontrib><creatorcontrib>Shou, Mike Zheng</creatorcontrib><creatorcontrib>Zou, Yuexian</creatorcontrib><title>Deep Motion Prior for Weakly-Supervised Temporal Action Localization</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><description>Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the video-level prediction. However, we argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss. In this paper, we analyze that the motion cues behind the optical flow features are complementary informative. Inspired by this, we propose to build a context-dependent motion prior, termed as motionness . Specifically, a motion graph is introduced to model motionness based on the local motion carrier ( e.g ., optical flow). In addition, to highlight more informative video snippets, a motion-guided loss is proposed to modulate the network training conditioned on motionness scores. Extensive ablation studies confirm that motionness efficaciously models action-of-interest, and the motion-guided loss leads to more accurate results. Besides, our motion-guided loss is a plug-and-play loss function and is applicable with existing WSTAL methods. Without loss of generality, based on the standard MIL pipeline, our method achieves new state-of-the-art performance on three challenging benchmarks, including THUMOS'14, ActivityNet v1.2 and v1.3.</description><subject>Ablation</subject><subject>Adaptive optics</subject><subject>Deep Motion Prior</subject><subject>Feature extraction</subject><subject>Incompatibility</subject><subject>Localization</subject><subject>Location awareness</subject><subject>Motion-guided Loss</subject><subject>Optical flow (image analysis)</subject><subject>Optical imaging</subject><subject>Optical losses</subject><subject>Training</subject><subject>Videos</subject><subject>Weakly-Supervised Temporal Action Localization (WSTAL)</subject><subject>Xenon</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kNFLwzAQxoMoOKfvgi8FnztzSdrkHsfmdDBx4MTHkLVX6OyWmm7C_OvN3JDjuPvg--7gx9gt8AEAx4fFdD4QXIiBBJQ6E2esB6gg5VyJ87jzTKcaFF6yq65bcQ4qg7zHxmOiNnnx29pvknmofUiq2B_kPpt9-rZrKXzXHZXJgtatD65JhsWfd-YL19Q_7iCu2UXlmo5uTrPP3iePi9FzOnt9mo6Gs7QQCNsUgWsU5TJHxzVEpQySw6UmmUGhpCmoRK0LFUtKEqXOM86llpVGg1Ul--z-eLcN_mtH3dau_C5s4ksrcjQGJHKILn50FcF3XaDKtqFeu7C3wO2BlY2s7IGVPbGKkbtjpCaifzsalZvcyF_122Op</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Cao, Meng</creator><creator>Zhang, Can</creator><creator>Chen, Long</creator><creator>Shou, Mike Zheng</creator><creator>Zou, Yuexian</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-8946-4228</orcidid><orcidid>https://orcid.org/0000-0001-9530-5218</orcidid><orcidid>https://orcid.org/0000-0001-6148-9709</orcidid></search><sort><creationdate>2022</creationdate><title>Deep Motion Prior for Weakly-Supervised Temporal Action Localization</title><author>Cao, Meng ; Zhang, Can ; Chen, Long ; Shou, Mike Zheng ; Zou, Yuexian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-910792db69a071910489ea9b7e351c438ced977c4c4c33e2d76500373f7989ff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Ablation</topic><topic>Adaptive optics</topic><topic>Deep Motion Prior</topic><topic>Feature extraction</topic><topic>Incompatibility</topic><topic>Localization</topic><topic>Location awareness</topic><topic>Motion-guided Loss</topic><topic>Optical flow (image analysis)</topic><topic>Optical imaging</topic><topic>Optical losses</topic><topic>Training</topic><topic>Videos</topic><topic>Weakly-Supervised Temporal Action Localization (WSTAL)</topic><topic>Xenon</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Cao, Meng</creatorcontrib><creatorcontrib>Zhang, Can</creatorcontrib><creatorcontrib>Chen, Long</creatorcontrib><creatorcontrib>Shou, Mike Zheng</creatorcontrib><creatorcontrib>Zou, Yuexian</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cao, Meng</au><au>Zhang, Can</au><au>Chen, Long</au><au>Shou, Mike Zheng</au><au>Zou, Yuexian</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Deep Motion Prior for Weakly-Supervised Temporal Action Localization</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><date>2022</date><risdate>2022</risdate><volume>31</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the video-level prediction. However, we argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss. In this paper, we analyze that the motion cues behind the optical flow features are complementary informative. Inspired by this, we propose to build a context-dependent motion prior, termed as motionness . Specifically, a motion graph is introduced to model motionness based on the local motion carrier ( e.g ., optical flow). In addition, to highlight more informative video snippets, a motion-guided loss is proposed to modulate the network training conditioned on motionness scores. Extensive ablation studies confirm that motionness efficaciously models action-of-interest, and the motion-guided loss leads to more accurate results. Besides, our motion-guided loss is a plug-and-play loss function and is applicable with existing WSTAL methods. Without loss of generality, based on the standard MIL pipeline, our method achieves new state-of-the-art performance on three challenging benchmarks, including THUMOS'14, ActivityNet v1.2 and v1.3.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TIP.2022.3193752</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-8946-4228</orcidid><orcidid>https://orcid.org/0000-0001-9530-5218</orcidid><orcidid>https://orcid.org/0000-0001-6148-9709</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1057-7149 |
ispartof | IEEE transactions on image processing, 2022, Vol.31, p.1-1 |
issn | 1057-7149 1941-0042 |
language | eng |
recordid | cdi_crossref_primary_10_1109_TIP_2022_3193752 |
source | IEEE Electronic Library (IEL) |
subjects | Ablation Adaptive optics Deep Motion Prior Feature extraction Incompatibility Localization Location awareness Motion-guided Loss Optical flow (image analysis) Optical imaging Optical losses Training Videos Weakly-Supervised Temporal Action Localization (WSTAL) Xenon |
title | Deep Motion Prior for Weakly-Supervised Temporal Action Localization |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T20%3A02%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Deep%20Motion%20Prior%20for%20Weakly-Supervised%20Temporal%20Action%20Localization&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Cao,%20Meng&rft.date=2022&rft.volume=31&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2022.3193752&rft_dat=%3Cproquest_RIE%3E2698813901%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2698813901&rft_id=info:pmid/&rft_ieee_id=9846868&rfr_iscdi=true |