Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition
This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in e...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on pattern analysis and machine intelligence 2024-12, p.1-18 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 18 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE transactions on pattern analysis and machine intelligence |
container_volume | |
creator | Wang, Yulin Zhang, Haoji Yue, Yang Song, Shiji Deng, Chao Feng, Junlan Huang, Gao |
description | This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at htt |
doi_str_mv | 10.1109/TPAMI.2024.3514654 |
format | Article |
fullrecord | <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TPAMI_2024_3514654</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10787270</ieee_id><sourcerecordid>10_1109_TPAMI_2024_3514654</sourcerecordid><originalsourceid>FETCH-LOGICAL-c640-3b6d1d531e882d2767c86bfbd468c6b8eba751b3431dfdc6a0cf4b2ab19805483</originalsourceid><addsrcrecordid>eNpNkM1Kw0AUhQdRMFZfQFzkBabeOzOZTNyFam2homh0G-YvMtJkQtIu-va2tgtXF87lOxw-Qm4RpohQ3Fdv5ctyyoCJKc9QyEyckYShBFqwgp2TBFAyqhRTl-RqHH8AUGTAE7L47AItnZ5Hux0f0o9eb4Je08q3fRz0On3cdboNNp3Ftt9u9s_YpU0c0q_gfEzfvY3fXTik1-Si0evR35zuhFTzp2q2oKvX5-WsXFErBVBupEOXcfT7LY7lMrdKmsY4IZWVRnmj8wwNFxxd46zUYBthmDZYKMiE4hPCjrV2iOM4-Kbuh9DqYVcj1AcV9Z-K-qCiPqnYQ3dHKHjv_wG5ylkO_BcukVsr</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</title><source>IEEE/IET Electronic Library (IEL)</source><creator>Wang, Yulin ; Zhang, Haoji ; Yue, Yang ; Song, Shiji ; Deng, Chao ; Feng, Junlan ; Huang, Gao</creator><creatorcontrib>Wang, Yulin ; Zhang, Haoji ; Yue, Yang ; Song, Shiji ; Deng, Chao ; Feng, Junlan ; Huang, Gao</creatorcontrib><description>This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at https://github.com/blackfeather-wang/AdaFocus and https://github.com/LeapLabTHU/AdaFocusV2 .</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2024.3514654</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Computational efficiency ; Computational modeling ; Dynamic neural networks ; efficient deep learning ; Feature extraction ; Heuristic algorithms ; Image recognition ; Redundancy ; Termination of employment ; Training ; video recognition ; X3D</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2024-12, p.1-18</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0009-0005-3155-1336 ; 0000-0002-1363-0234 ; 0000-0002-7251-0988 ; 0000-0003-0858-1770</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10787270$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10787270$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Yulin</creatorcontrib><creatorcontrib>Zhang, Haoji</creatorcontrib><creatorcontrib>Yue, Yang</creatorcontrib><creatorcontrib>Song, Shiji</creatorcontrib><creatorcontrib>Deng, Chao</creatorcontrib><creatorcontrib>Feng, Junlan</creatorcontrib><creatorcontrib>Huang, Gao</creatorcontrib><title>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><description>This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at https://github.com/blackfeather-wang/AdaFocus and https://github.com/LeapLabTHU/AdaFocusV2 .</description><subject>Accuracy</subject><subject>Computational efficiency</subject><subject>Computational modeling</subject><subject>Dynamic neural networks</subject><subject>efficient deep learning</subject><subject>Feature extraction</subject><subject>Heuristic algorithms</subject><subject>Image recognition</subject><subject>Redundancy</subject><subject>Termination of employment</subject><subject>Training</subject><subject>video recognition</subject><subject>X3D</subject><issn>0162-8828</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkM1Kw0AUhQdRMFZfQFzkBabeOzOZTNyFam2homh0G-YvMtJkQtIu-va2tgtXF87lOxw-Qm4RpohQ3Fdv5ctyyoCJKc9QyEyckYShBFqwgp2TBFAyqhRTl-RqHH8AUGTAE7L47AItnZ5Hux0f0o9eb4Je08q3fRz0On3cdboNNp3Ftt9u9s_YpU0c0q_gfEzfvY3fXTik1-Si0evR35zuhFTzp2q2oKvX5-WsXFErBVBupEOXcfT7LY7lMrdKmsY4IZWVRnmj8wwNFxxd46zUYBthmDZYKMiE4hPCjrV2iOM4-Kbuh9DqYVcj1AcV9Z-K-qCiPqnYQ3dHKHjv_wG5ylkO_BcukVsr</recordid><startdate>20241209</startdate><enddate>20241209</enddate><creator>Wang, Yulin</creator><creator>Zhang, Haoji</creator><creator>Yue, Yang</creator><creator>Song, Shiji</creator><creator>Deng, Chao</creator><creator>Feng, Junlan</creator><creator>Huang, Gao</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0009-0005-3155-1336</orcidid><orcidid>https://orcid.org/0000-0002-1363-0234</orcidid><orcidid>https://orcid.org/0000-0002-7251-0988</orcidid><orcidid>https://orcid.org/0000-0003-0858-1770</orcidid></search><sort><creationdate>20241209</creationdate><title>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</title><author>Wang, Yulin ; Zhang, Haoji ; Yue, Yang ; Song, Shiji ; Deng, Chao ; Feng, Junlan ; Huang, Gao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c640-3b6d1d531e882d2767c86bfbd468c6b8eba751b3431dfdc6a0cf4b2ab19805483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Computational efficiency</topic><topic>Computational modeling</topic><topic>Dynamic neural networks</topic><topic>efficient deep learning</topic><topic>Feature extraction</topic><topic>Heuristic algorithms</topic><topic>Image recognition</topic><topic>Redundancy</topic><topic>Termination of employment</topic><topic>Training</topic><topic>video recognition</topic><topic>X3D</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Yulin</creatorcontrib><creatorcontrib>Zhang, Haoji</creatorcontrib><creatorcontrib>Yue, Yang</creatorcontrib><creatorcontrib>Song, Shiji</creatorcontrib><creatorcontrib>Deng, Chao</creatorcontrib><creatorcontrib>Feng, Junlan</creatorcontrib><creatorcontrib>Huang, Gao</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Yulin</au><au>Zhang, Haoji</au><au>Yue, Yang</au><au>Song, Shiji</au><au>Deng, Chao</au><au>Feng, Junlan</au><au>Huang, Gao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><date>2024-12-09</date><risdate>2024</risdate><spage>1</spage><epage>18</epage><pages>1-18</pages><issn>0162-8828</issn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy , which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample- wise redundancies, i.e. , allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively "easier" videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample- wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models ( e.g. , TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets ( i.e. , ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios ( i.e. , fine-grained diving action classification, Alzheimer's and Parkinson's diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines. Code and pre-trained models are available at https://github.com/blackfeather-wang/AdaFocus and https://github.com/LeapLabTHU/AdaFocusV2 .</abstract><pub>IEEE</pub><doi>10.1109/TPAMI.2024.3514654</doi><tpages>18</tpages><orcidid>https://orcid.org/0009-0005-3155-1336</orcidid><orcidid>https://orcid.org/0000-0002-1363-0234</orcidid><orcidid>https://orcid.org/0000-0002-7251-0988</orcidid><orcidid>https://orcid.org/0000-0003-0858-1770</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 0162-8828 |
ispartof | IEEE transactions on pattern analysis and machine intelligence, 2024-12, p.1-18 |
issn | 0162-8828 2160-9292 |
language | eng |
recordid | cdi_crossref_primary_10_1109_TPAMI_2024_3514654 |
source | IEEE/IET Electronic Library (IEL) |
subjects | Accuracy Computational efficiency Computational modeling Dynamic neural networks efficient deep learning Feature extraction Heuristic algorithms Image recognition Redundancy Termination of employment Training video recognition X3D |
title | Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T14%3A42%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Uni-AdaFocus:%20Spatial-Temporal%20Dynamic%20Computation%20for%20Video%20Recognition&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wang,%20Yulin&rft.date=2024-12-09&rft.spage=1&rft.epage=18&rft.pages=1-18&rft.issn=0162-8828&rft.eissn=2160-9292&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2024.3514654&rft_dat=%3Ccrossref_RIE%3E10_1109_TPAMI_2024_3514654%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10787270&rfr_iscdi=true |