B2C-AFM: Bi-directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition

Human Action Recognition plays a driving engine of many human-computer interaction applications. Most current researches focus on improving the model generalization by integrating multiple homogeneous modalities, including RGB images, human poses, and optical flows. Furthermore, contextual interacti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing 2023-01, Vol.32, p.1-1
Hauptverfasser: Guo, Fangtai, Jin, Tianlei, Zhu, Shiqiang, Xi, Xiangming, Wang, Wen, Meng, Qiwei, Song, Wei, Zhu, Jiakai
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1
container_issue
container_start_page 1
container_title IEEE transactions on image processing
container_volume 32
creator Guo, Fangtai
Jin, Tianlei
Zhu, Shiqiang
Xi, Xiangming
Wang, Wen
Meng, Qiwei
Song, Wei
Zhu, Jiakai
description Human Action Recognition plays a driving engine of many human-computer interaction applications. Most current researches focus on improving the model generalization by integrating multiple homogeneous modalities, including RGB images, human poses, and optical flows. Furthermore, contextual interactions and out-of-context sign languages have been validated to depend on scene category and human per se. Those attempts to integrate appearance features and human poses have shown positive results. However, with human poses' spatial errors and temporal ambiguities, existing methods are subject to poor scalability, limited robustness, and sub-optimal models. In this paper, inspired by the assumption that different modalities may maintain temporal consistency and spatial complementarity, we present a novel Bi-directional Co-temporal and Cross-spatial Attention Fusion Model (B2C-AFM). Our model is characterized by the asynchronous fusion strategy of multi-modal features along temporal and spatial dimensions. Besides, the novel explicit motion-oriented pose representations called Limb Flow Fields (Lff) are explored to alleviate the temporal ambiguity regarding human poses. Experiments on publicly available datasets validate our contributions. Abundant ablation studies experimentally show that B2C-AFM achieves robust performance across seen and unseen human actions. The codes are available here 1 .
doi_str_mv 10.1109/TIP.2023.3308750
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10235872</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10235872</ieee_id><sourcerecordid>2859603683</sourcerecordid><originalsourceid>FETCH-LOGICAL-c325t-ef7fc8f46df06abd01a70f84d7bda5c08fb1b7e055c1762caf6a32aab49318c13</originalsourceid><addsrcrecordid>eNpdkDFPwzAQRiMEEqWwMzBEYmFxOduxnbC1EaVIrUBQ5shxbOQqjYOdDPx7EtoBMd2n0_tOuhdF1xhmGEN2v31-nREgdEYppILBSTTBWYIRQEJOhwxMIIGT7Dy6CGEHgBOG-SSyC5Kj-XLzEC8sqqzXqrOukXWcO7TV-9b5IcuminPvQkDvrezssJl3nW5GMl72YRwbV-k6Ns7Hq34vm3j-eyd-08p9NnbMl9GZkXXQV8c5jT6Wj9t8hdYvT8_5fI0UJaxD2gijUpPwygCXZQVYCjBpUomykkxBakpcCg2MKSw4UdJwSYmUZZJRnCpMp9Hd4W7r3VevQ1fsbVC6rmWjXR8KkrKMA-UpHdDbf-jO9X74fqQ44VRklA0UHCg1KvDaFK23e-m_CwzF6L4Y3Bej--LofqjcHCpWa_0HJ5SlgtAfpYl_Ww</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2862637935</pqid></control><display><type>article</type><title>B2C-AFM: Bi-directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition</title><source>IEEE Electronic Library (IEL)</source><creator>Guo, Fangtai ; Jin, Tianlei ; Zhu, Shiqiang ; Xi, Xiangming ; Wang, Wen ; Meng, Qiwei ; Song, Wei ; Zhu, Jiakai</creator><creatorcontrib>Guo, Fangtai ; Jin, Tianlei ; Zhu, Shiqiang ; Xi, Xiangming ; Wang, Wen ; Meng, Qiwei ; Song, Wei ; Zhu, Jiakai</creatorcontrib><description>Human Action Recognition plays a driving engine of many human-computer interaction applications. Most current researches focus on improving the model generalization by integrating multiple homogeneous modalities, including RGB images, human poses, and optical flows. Furthermore, contextual interactions and out-of-context sign languages have been validated to depend on scene category and human per se. Those attempts to integrate appearance features and human poses have shown positive results. However, with human poses' spatial errors and temporal ambiguities, existing methods are subject to poor scalability, limited robustness, and sub-optimal models. In this paper, inspired by the assumption that different modalities may maintain temporal consistency and spatial complementarity, we present a novel Bi-directional Co-temporal and Cross-spatial Attention Fusion Model (B2C-AFM). Our model is characterized by the asynchronous fusion strategy of multi-modal features along temporal and spatial dimensions. Besides, the novel explicit motion-oriented pose representations called Limb Flow Fields (Lff) are explored to alleviate the temporal ambiguity regarding human poses. Experiments on publicly available datasets validate our contributions. Abundant ablation studies experimentally show that B2C-AFM achieves robust performance across seen and unseen human actions. The codes are available here 1 .</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2023.3308750</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Ablation ; B2C-AFM ; Color imagery ; fusion model ; homogeneous modalities ; Human action recognition ; Human activity recognition ; limb flow fields ; Optical flow (image analysis)</subject><ispartof>IEEE transactions on image processing, 2023-01, Vol.32, p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c325t-ef7fc8f46df06abd01a70f84d7bda5c08fb1b7e055c1762caf6a32aab49318c13</citedby><cites>FETCH-LOGICAL-c325t-ef7fc8f46df06abd01a70f84d7bda5c08fb1b7e055c1762caf6a32aab49318c13</cites><orcidid>0000-0003-2786-8144 ; 0000-0002-4749-9908 ; 0000-0002-0828-7486 ; 0000-0002-5687-4001 ; 0000-0001-8682-7946 ; 0009-0009-1644-2341</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10235872$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10235872$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Guo, Fangtai</creatorcontrib><creatorcontrib>Jin, Tianlei</creatorcontrib><creatorcontrib>Zhu, Shiqiang</creatorcontrib><creatorcontrib>Xi, Xiangming</creatorcontrib><creatorcontrib>Wang, Wen</creatorcontrib><creatorcontrib>Meng, Qiwei</creatorcontrib><creatorcontrib>Song, Wei</creatorcontrib><creatorcontrib>Zhu, Jiakai</creatorcontrib><title>B2C-AFM: Bi-directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><description>Human Action Recognition plays a driving engine of many human-computer interaction applications. Most current researches focus on improving the model generalization by integrating multiple homogeneous modalities, including RGB images, human poses, and optical flows. Furthermore, contextual interactions and out-of-context sign languages have been validated to depend on scene category and human per se. Those attempts to integrate appearance features and human poses have shown positive results. However, with human poses' spatial errors and temporal ambiguities, existing methods are subject to poor scalability, limited robustness, and sub-optimal models. In this paper, inspired by the assumption that different modalities may maintain temporal consistency and spatial complementarity, we present a novel Bi-directional Co-temporal and Cross-spatial Attention Fusion Model (B2C-AFM). Our model is characterized by the asynchronous fusion strategy of multi-modal features along temporal and spatial dimensions. Besides, the novel explicit motion-oriented pose representations called Limb Flow Fields (Lff) are explored to alleviate the temporal ambiguity regarding human poses. Experiments on publicly available datasets validate our contributions. Abundant ablation studies experimentally show that B2C-AFM achieves robust performance across seen and unseen human actions. The codes are available here 1 .</description><subject>Ablation</subject><subject>B2C-AFM</subject><subject>Color imagery</subject><subject>fusion model</subject><subject>homogeneous modalities</subject><subject>Human action recognition</subject><subject>Human activity recognition</subject><subject>limb flow fields</subject><subject>Optical flow (image analysis)</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkDFPwzAQRiMEEqWwMzBEYmFxOduxnbC1EaVIrUBQ5shxbOQqjYOdDPx7EtoBMd2n0_tOuhdF1xhmGEN2v31-nREgdEYppILBSTTBWYIRQEJOhwxMIIGT7Dy6CGEHgBOG-SSyC5Kj-XLzEC8sqqzXqrOukXWcO7TV-9b5IcuminPvQkDvrezssJl3nW5GMl72YRwbV-k6Ns7Hq34vm3j-eyd-08p9NnbMl9GZkXXQV8c5jT6Wj9t8hdYvT8_5fI0UJaxD2gijUpPwygCXZQVYCjBpUomykkxBakpcCg2MKSw4UdJwSYmUZZJRnCpMp9Hd4W7r3VevQ1fsbVC6rmWjXR8KkrKMA-UpHdDbf-jO9X74fqQ44VRklA0UHCg1KvDaFK23e-m_CwzF6L4Y3Bej--LofqjcHCpWa_0HJ5SlgtAfpYl_Ww</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Guo, Fangtai</creator><creator>Jin, Tianlei</creator><creator>Zhu, Shiqiang</creator><creator>Xi, Xiangming</creator><creator>Wang, Wen</creator><creator>Meng, Qiwei</creator><creator>Song, Wei</creator><creator>Zhu, Jiakai</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-2786-8144</orcidid><orcidid>https://orcid.org/0000-0002-4749-9908</orcidid><orcidid>https://orcid.org/0000-0002-0828-7486</orcidid><orcidid>https://orcid.org/0000-0002-5687-4001</orcidid><orcidid>https://orcid.org/0000-0001-8682-7946</orcidid><orcidid>https://orcid.org/0009-0009-1644-2341</orcidid></search><sort><creationdate>20230101</creationdate><title>B2C-AFM: Bi-directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition</title><author>Guo, Fangtai ; Jin, Tianlei ; Zhu, Shiqiang ; Xi, Xiangming ; Wang, Wen ; Meng, Qiwei ; Song, Wei ; Zhu, Jiakai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c325t-ef7fc8f46df06abd01a70f84d7bda5c08fb1b7e055c1762caf6a32aab49318c13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Ablation</topic><topic>B2C-AFM</topic><topic>Color imagery</topic><topic>fusion model</topic><topic>homogeneous modalities</topic><topic>Human action recognition</topic><topic>Human activity recognition</topic><topic>limb flow fields</topic><topic>Optical flow (image analysis)</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Guo, Fangtai</creatorcontrib><creatorcontrib>Jin, Tianlei</creatorcontrib><creatorcontrib>Zhu, Shiqiang</creatorcontrib><creatorcontrib>Xi, Xiangming</creatorcontrib><creatorcontrib>Wang, Wen</creatorcontrib><creatorcontrib>Meng, Qiwei</creatorcontrib><creatorcontrib>Song, Wei</creatorcontrib><creatorcontrib>Zhu, Jiakai</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Guo, Fangtai</au><au>Jin, Tianlei</au><au>Zhu, Shiqiang</au><au>Xi, Xiangming</au><au>Wang, Wen</au><au>Meng, Qiwei</au><au>Song, Wei</au><au>Zhu, Jiakai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>B2C-AFM: Bi-directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>32</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Human Action Recognition plays a driving engine of many human-computer interaction applications. Most current researches focus on improving the model generalization by integrating multiple homogeneous modalities, including RGB images, human poses, and optical flows. Furthermore, contextual interactions and out-of-context sign languages have been validated to depend on scene category and human per se. Those attempts to integrate appearance features and human poses have shown positive results. However, with human poses' spatial errors and temporal ambiguities, existing methods are subject to poor scalability, limited robustness, and sub-optimal models. In this paper, inspired by the assumption that different modalities may maintain temporal consistency and spatial complementarity, we present a novel Bi-directional Co-temporal and Cross-spatial Attention Fusion Model (B2C-AFM). Our model is characterized by the asynchronous fusion strategy of multi-modal features along temporal and spatial dimensions. Besides, the novel explicit motion-oriented pose representations called Limb Flow Fields (Lff) are explored to alleviate the temporal ambiguity regarding human poses. Experiments on publicly available datasets validate our contributions. Abundant ablation studies experimentally show that B2C-AFM achieves robust performance across seen and unseen human actions. The codes are available here 1 .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TIP.2023.3308750</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0003-2786-8144</orcidid><orcidid>https://orcid.org/0000-0002-4749-9908</orcidid><orcidid>https://orcid.org/0000-0002-0828-7486</orcidid><orcidid>https://orcid.org/0000-0002-5687-4001</orcidid><orcidid>https://orcid.org/0000-0001-8682-7946</orcidid><orcidid>https://orcid.org/0009-0009-1644-2341</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1057-7149
ispartof IEEE transactions on image processing, 2023-01, Vol.32, p.1-1
issn 1057-7149
1941-0042
language eng
recordid cdi_ieee_primary_10235872
source IEEE Electronic Library (IEL)
subjects Ablation
B2C-AFM
Color imagery
fusion model
homogeneous modalities
Human action recognition
Human activity recognition
limb flow fields
Optical flow (image analysis)
title B2C-AFM: Bi-directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T18%3A35%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=B2C-AFM:%20Bi-directional%20Co-Temporal%20and%20Cross-Spatial%20Attention%20Fusion%20Model%20for%20Human%20Action%20Recognition&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Guo,%20Fangtai&rft.date=2023-01-01&rft.volume=32&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2023.3308750&rft_dat=%3Cproquest_RIE%3E2859603683%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2862637935&rft_id=info:pmid/&rft_ieee_id=10235872&rfr_iscdi=true