Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition

This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in e...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence 2024-12, p.1-18
Hauptverfasser: Wang, Yulin, Zhang, Haoji, Yue, Yang, Song, Shiji, Deng, Chao, Feng, Junlan, Huang, Gao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 18
container_issue
container_start_page 1
container_title IEEE transactions on pattern analysis and machine intelligence
container_volume
creator Wang, Yulin
Zhang, Haoji
Yue, Yang
Song, Shiji
Deng, Chao
Feng, Junlan
Huang, Gao
description This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at htt
doi_str_mv 10.1109/TPAMI.2024.3514654
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TPAMI_2024_3514654</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10787270</ieee_id><sourcerecordid>10_1109_TPAMI_2024_3514654</sourcerecordid><originalsourceid>FETCH-LOGICAL-c640-3b6d1d531e882d2767c86bfbd468c6b8eba751b3431dfdc6a0cf4b2ab19805483</originalsourceid><addsrcrecordid>eNpNkM1Kw0AUhQdRMFZfQFzkBabeOzOZTNyFam2homh0G-YvMtJkQtIu-va2tgtXF87lOxw-Qm4RpohQ3Fdv5ctyyoCJKc9QyEyckYShBFqwgp2TBFAyqhRTl-RqHH8AUGTAE7L47AItnZ5Hux0f0o9eb4Je08q3fRz0On3cdboNNp3Ftt9u9s_YpU0c0q_gfEzfvY3fXTik1-Si0evR35zuhFTzp2q2oKvX5-WsXFErBVBupEOXcfT7LY7lMrdKmsY4IZWVRnmj8wwNFxxd46zUYBthmDZYKMiE4hPCjrV2iOM4-Kbuh9DqYVcj1AcV9Z-K-qCiPqnYQ3dHKHjv_wG5ylkO_BcukVsr</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</title><source>IEEE/IET Electronic Library (IEL)</source><creator>Wang, Yulin ; Zhang, Haoji ; Yue, Yang ; Song, Shiji ; Deng, Chao ; Feng, Junlan ; Huang, Gao</creator><creatorcontrib>Wang, Yulin ; Zhang, Haoji ; Yue, Yang ; Song, Shiji ; Deng, Chao ; Feng, Junlan ; Huang, Gao</creatorcontrib><description>This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&amp;V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at https://github.com/blackfeather-wang/AdaFocus and https://github.com/LeapLabTHU/AdaFocusV2 .</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2024.3514654</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Computational efficiency ; Computational modeling ; Dynamic neural networks ; efficient deep learning ; Feature extraction ; Heuristic algorithms ; Image recognition ; Redundancy ; Termination of employment ; Training ; video recognition ; X3D</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2024-12, p.1-18</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0009-0005-3155-1336 ; 0000-0002-1363-0234 ; 0000-0002-7251-0988 ; 0000-0003-0858-1770</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10787270$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10787270$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Yulin</creatorcontrib><creatorcontrib>Zhang, Haoji</creatorcontrib><creatorcontrib>Yue, Yang</creatorcontrib><creatorcontrib>Song, Shiji</creatorcontrib><creatorcontrib>Deng, Chao</creatorcontrib><creatorcontrib>Feng, Junlan</creatorcontrib><creatorcontrib>Huang, Gao</creatorcontrib><title>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><description>This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&amp;V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at https://github.com/blackfeather-wang/AdaFocus and https://github.com/LeapLabTHU/AdaFocusV2 .</description><subject>Accuracy</subject><subject>Computational efficiency</subject><subject>Computational modeling</subject><subject>Dynamic neural networks</subject><subject>efficient deep learning</subject><subject>Feature extraction</subject><subject>Heuristic algorithms</subject><subject>Image recognition</subject><subject>Redundancy</subject><subject>Termination of employment</subject><subject>Training</subject><subject>video recognition</subject><subject>X3D</subject><issn>0162-8828</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkM1Kw0AUhQdRMFZfQFzkBabeOzOZTNyFam2homh0G-YvMtJkQtIu-va2tgtXF87lOxw-Qm4RpohQ3Fdv5ctyyoCJKc9QyEyckYShBFqwgp2TBFAyqhRTl-RqHH8AUGTAE7L47AItnZ5Hux0f0o9eb4Je08q3fRz0On3cdboNNp3Ftt9u9s_YpU0c0q_gfEzfvY3fXTik1-Si0evR35zuhFTzp2q2oKvX5-WsXFErBVBupEOXcfT7LY7lMrdKmsY4IZWVRnmj8wwNFxxd46zUYBthmDZYKMiE4hPCjrV2iOM4-Kbuh9DqYVcj1AcV9Z-K-qCiPqnYQ3dHKHjv_wG5ylkO_BcukVsr</recordid><startdate>20241209</startdate><enddate>20241209</enddate><creator>Wang, Yulin</creator><creator>Zhang, Haoji</creator><creator>Yue, Yang</creator><creator>Song, Shiji</creator><creator>Deng, Chao</creator><creator>Feng, Junlan</creator><creator>Huang, Gao</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0009-0005-3155-1336</orcidid><orcidid>https://orcid.org/0000-0002-1363-0234</orcidid><orcidid>https://orcid.org/0000-0002-7251-0988</orcidid><orcidid>https://orcid.org/0000-0003-0858-1770</orcidid></search><sort><creationdate>20241209</creationdate><title>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</title><author>Wang, Yulin ; Zhang, Haoji ; Yue, Yang ; Song, Shiji ; Deng, Chao ; Feng, Junlan ; Huang, Gao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c640-3b6d1d531e882d2767c86bfbd468c6b8eba751b3431dfdc6a0cf4b2ab19805483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Computational efficiency</topic><topic>Computational modeling</topic><topic>Dynamic neural networks</topic><topic>efficient deep learning</topic><topic>Feature extraction</topic><topic>Heuristic algorithms</topic><topic>Image recognition</topic><topic>Redundancy</topic><topic>Termination of employment</topic><topic>Training</topic><topic>video recognition</topic><topic>X3D</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Yulin</creatorcontrib><creatorcontrib>Zhang, Haoji</creatorcontrib><creatorcontrib>Yue, Yang</creatorcontrib><creatorcontrib>Song, Shiji</creatorcontrib><creatorcontrib>Deng, Chao</creatorcontrib><creatorcontrib>Feng, Junlan</creatorcontrib><creatorcontrib>Huang, Gao</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Yulin</au><au>Zhang, Haoji</au><au>Yue, Yang</au><au>Song, Shiji</au><au>Deng, Chao</au><au>Feng, Junlan</au><au>Huang, Gao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><date>2024-12-09</date><risdate>2024</risdate><spage>1</spage><epage>18</epage><pages>1-18</pages><issn>0162-8828</issn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&amp;V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at https://github.com/blackfeather-wang/AdaFocus and https://github.com/LeapLabTHU/AdaFocusV2 .</abstract><pub>IEEE</pub><doi>10.1109/TPAMI.2024.3514654</doi><tpages>18</tpages><orcidid>https://orcid.org/0009-0005-3155-1336</orcidid><orcidid>https://orcid.org/0000-0002-1363-0234</orcidid><orcidid>https://orcid.org/0000-0002-7251-0988</orcidid><orcidid>https://orcid.org/0000-0003-0858-1770</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 0162-8828
ispartof IEEE transactions on pattern analysis and machine intelligence, 2024-12, p.1-18
issn 0162-8828
2160-9292
language eng
recordid cdi_crossref_primary_10_1109_TPAMI_2024_3514654
source IEEE/IET Electronic Library (IEL)
subjects Accuracy
Computational efficiency
Computational modeling
Dynamic neural networks
efficient deep learning
Feature extraction
Heuristic algorithms
Image recognition
Redundancy
Termination of employment
Training
video recognition
X3D
title Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T14%3A42%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Uni-AdaFocus:%20Spatial-Temporal%20Dynamic%20Computation%20for%20Video%20Recognition&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wang,%20Yulin&rft.date=2024-12-09&rft.spage=1&rft.epage=18&rft.pages=1-18&rft.issn=0162-8828&rft.eissn=2160-9292&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2024.3514654&rft_dat=%3Ccrossref_RIE%3E10_1109_TPAMI_2024_3514654%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10787270&rfr_iscdi=true