Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition

This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in e...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence 2024-12, p.1-18
Hauptverfasser:	Wang, Yulin, Zhang, Haoji, Yue, Yang, Song, Shiji, Deng, Chao, Feng, Junlan, Huang, Gao
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Computational efficiency Computational modeling Dynamic neural networks efficient deep learning Feature extraction Heuristic algorithms Image recognition Redundancy Termination of employment Training video recognition X3D
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	18
container_issue
container_start_page	1
container_title	IEEE transactions on pattern analysis and machine intelligence
container_volume
creator	Wang, Yulin Zhang, Haoji Yue, Yang Song, Shiji Deng, Chao Feng, Junlan Huang, Gao
description	This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at htt
doi_str_mv	10.1109/TPAMI.2024.3514654
format	Article
fullrecord	<record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TPAMI_2024_3514654</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10787270</ieee_id><sourcerecordid>10_1109_TPAMI_2024_3514654</sourcerecordid><originalsourceid>FETCH-LOGICAL-c640-3b6d1d531e882d2767c86bfbd468c6b8eba751b3431dfdc6a0cf4b2ab19805483</originalsourceid><addsrcrecordid>eNpNkM1Kw0AUhQdRMFZfQFzkBabeOzOZTNyFam2homh0G-YvMtJkQtIu-va2tgtXF87lOxw-Qm4RpohQ3Fdv5ctyyoCJKc9QyEyckYShBFqwgp2TBFAyqhRTl-RqHH8AUGTAE7L47AItnZ5Hux0f0o9eb4Je08q3fRz0On3cdboNNp3Ftt9u9s_YpU0c0q_gfEzfvY3fXTik1-Si0evR35zuhFTzp2q2oKvX5-WsXFErBVBupEOXcfT7LY7lMrdKmsY4IZWVRnmj8wwNFxxd46zUYBthmDZYKMiE4hPCjrV2iOM4-Kbuh9DqYVcj1AcV9Z-K-qCiPqnYQ3dHKHjv_wG5ylkO_BcukVsr</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</title><source>IEEE/IET Electronic Library (IEL)</source><creator>Wang, Yulin ; Zhang, Haoji ; Yue, Yang ; Song, Shiji ; Deng, Chao ; Feng, Junlan ; Huang, Gao</creator><creatorcontrib>Wang, Yulin ; Zhang, Haoji ; Yue, Yang ; Song, Shiji ; Deng, Chao ; Feng, Junlan ; Huang, Gao</creatorcontrib><description>This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at https://github.com/blackfeather-wang/AdaFocus and https://github.com/LeapLabTHU/AdaFocusV2 .</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2024.3514654</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Computational efficiency ; Computational modeling ; Dynamic neural networks ; efficient deep learning ; Feature extraction ; Heuristic algorithms ; Image recognition ; Redundancy ; Termination of employment ; Training ; video recognition ; X3D</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2024-12, p.1-18</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0009-0005-3155-1336 ; 0000-0002-1363-0234 ; 0000-0002-7251-0988 ; 0000-0003-0858-1770</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10787270$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10787270$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Yulin</creatorcontrib><creatorcontrib>Zhang, Haoji</creatorcontrib><creatorcontrib>Yue, Yang</creatorcontrib><creatorcontrib>Song, Shiji</creatorcontrib><creatorcontrib>Deng, Chao</creatorcontrib><creatorcontrib>Feng, Junlan</creatorcontrib><creatorcontrib>Huang, Gao</creatorcontrib><title>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><description>This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at https://github.com/blackfeather-wang/AdaFocus and https://github.com/LeapLabTHU/AdaFocusV2 .</description><subject>Accuracy</subject><subject>Computational efficiency</subject><subject>Computational modeling</subject><subject>Dynamic neural networks</subject><subject>efficient deep learning</subject><subject>Feature extraction</subject><subject>Heuristic algorithms</subject><subject>Image recognition</subject><subject>Redundancy</subject><subject>Termination of employment</subject><subject>Training</subject><subject>video recognition</subject><subject>X3D</subject><issn>0162-8828</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkM1Kw0AUhQdRMFZfQFzkBabeOzOZTNyFam2homh0G-YvMtJkQtIu-va2tgtXF87lOxw-Qm4RpohQ3Fdv5ctyyoCJKc9QyEyckYShBFqwgp2TBFAyqhRTl-RqHH8AUGTAE7L47AItnZ5Hux0f0o9eb4Je08q3fRz0On3cdboNNp3Ftt9u9s_YpU0c0q_gfEzfvY3fXTik1-Si0evR35zuhFTzp2q2oKvX5-WsXFErBVBupEOXcfT7LY7lMrdKmsY4IZWVRnmj8wwNFxxd46zUYBthmDZYKMiE4hPCjrV2iOM4-Kbuh9DqYVcj1AcV9Z-K-qCiPqnYQ3dHKHjv_wG5ylkO_BcukVsr</recordid><startdate>20241209</startdate><enddate>20241209</enddate><creator>Wang, Yulin</creator><creator>Zhang, Haoji</creator><creator>Yue, Yang</creator><creator>Song, Shiji</creator><creator>Deng, Chao</creator><creator>Feng, Junlan</creator><creator>Huang, Gao</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0009-0005-3155-1336</orcidid><orcidid>https://orcid.org/0000-0002-1363-0234</orcidid><orcidid>https://orcid.org/0000-0002-7251-0988</orcidid><orcidid>https://orcid.org/0000-0003-0858-1770</orcidid></search><sort><creationdate>20241209</creationdate><title>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</title><author>Wang, Yulin ; Zhang, Haoji ; Yue, Yang ; Song, Shiji ; Deng, Chao ; Feng, Junlan ; Huang, Gao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c640-3b6d1d531e882d2767c86bfbd468c6b8eba751b3431dfdc6a0cf4b2ab19805483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Computational efficiency</topic><topic>Computational modeling</topic><topic>Dynamic neural networks</topic><topic>efficient deep learning</topic><topic>Feature extraction</topic><topic>Heuristic algorithms</topic><topic>Image recognition</topic><topic>Redundancy</topic><topic>Termination of employment</topic><topic>Training</topic><topic>video recognition</topic><topic>X3D</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Yulin</creatorcontrib><creatorcontrib>Zhang, Haoji</creatorcontrib><creatorcontrib>Yue, Yang</creatorcontrib><creatorcontrib>Song, Shiji</creatorcontrib><creatorcontrib>Deng, Chao</creatorcontrib><creatorcontrib>Feng, Junlan</creatorcontrib><creatorcontrib>Huang, Gao</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Yulin</au><au>Zhang, Haoji</au><au>Yue, Yang</au><au>Song, Shiji</au><au>Deng, Chao</au><au>Feng, Junlan</au><au>Huang, Gao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><date>2024-12-09</date><risdate>2024</risdate><spage>1</spage><epage>18</epage><pages>1-18</pages><issn>0162-8828</issn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at https://github.com/blackfeather-wang/AdaFocus and https://github.com/LeapLabTHU/AdaFocusV2 .</abstract><pub>IEEE</pub><doi>10.1109/TPAMI.2024.3514654</doi><tpages>18</tpages><orcidid>https://orcid.org/0009-0005-3155-1336</orcidid><orcidid>https://orcid.org/0000-0002-1363-0234</orcidid><orcidid>https://orcid.org/0000-0002-7251-0988</orcidid><orcidid>https://orcid.org/0000-0003-0858-1770</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0162-8828
ispartof	IEEE transactions on pattern analysis and machine intelligence, 2024-12, p.1-18
issn	0162-8828 2160-9292
language	eng
recordid	cdi_crossref_primary_10_1109_TPAMI_2024_3514654
source	IEEE/IET Electronic Library (IEL)
subjects	Accuracy Computational efficiency Computational modeling Dynamic neural networks efficient deep learning Feature extraction Heuristic algorithms Image recognition Redundancy Termination of employment Training video recognition X3D
title	Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T14%3A42%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Uni-AdaFocus:%20Spatial-Temporal%20Dynamic%20Computation%20for%20Video%20Recognition&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wang,%20Yulin&rft.date=2024-12-09&rft.spage=1&rft.epage=18&rft.pages=1-18&rft.issn=0162-8828&rft.eissn=2160-9292&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2024.3514654&rft_dat=%3Ccrossref_RIE%3E10_1109_TPAMI_2024_3514654%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10787270&rfr_iscdi=true