Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention

Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Signal, image and video processing image and video processing, 2023-06, Vol.17 (4), p.1173-1180
Hauptverfasser: Cao, Haiwen, Wu, Chunlei, Lu, Jing, Wu, Jie, Wang, Leiquan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1180
container_issue 4
container_start_page 1173
container_title Signal, image and video processing
container_volume 17
creator Cao, Haiwen
Wu, Chunlei
Lu, Jing
Wu, Jie
Wang, Leiquan
description Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.
doi_str_mv 10.1007/s11760-022-02324-x
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2805488213</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2805488213</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</originalsourceid><addsrcrecordid>eNp9kM1KAzEUhYMoWGpfwFXA9Wh-OpPUnRT_oOBCXYdMkqmp02RMUjruXPgGvqFPYuyI7rwQ7oF8517uAeAYo1OMEDuLGLMKFYiQ_CiZFv0eGGFe0QIzjPd_NaKHYBLjCuWihPGKj8D7fSeTle3n20cy684H2ULrVkYl6x10Jm19eD6Hpu9ab5N1Syg3vW2tDK-w9TGaCBsfoBz4YJRfOrvTW5ueoOw6GYxLUNumMVkpA6XTMJq2KWRK-SuzR-CgkW00k58-Bo9Xlw_zm2Jxd307v1gUiuJZKlSJUI1mrFZGaUQlnWJGaKk1ZporRrFSuqyqmpdEMV7imaKIyLrhmkmDpaRjcDLM7YJ_2ZiYxMpvgssrBeGonHJOMM0UGSgV8oHBNKILdp0PFhiJ78DFELjIgYtd4KLPJjqYYobd0oS_0f-4vgDTq4ia</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2805488213</pqid></control><display><type>article</type><title>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</title><source>SpringerNature Journals</source><creator>Cao, Haiwen ; Wu, Chunlei ; Lu, Jing ; Wu, Jie ; Wang, Leiquan</creator><creatorcontrib>Cao, Haiwen ; Wu, Chunlei ; Lu, Jing ; Wu, Jie ; Wang, Leiquan</creatorcontrib><description>Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.</description><identifier>ISSN: 1863-1703</identifier><identifier>EISSN: 1863-1711</identifier><identifier>DOI: 10.1007/s11760-022-02324-x</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Activity recognition ; Computer Imaging ; Computer Science ; Constraint modelling ; Image Processing and Computer Vision ; Multimedia Information Systems ; Optical flow (image analysis) ; Original Paper ; Pattern Recognition and Graphics ; Signal,Image and Speech Processing ; Streams ; Vision</subject><ispartof>Signal, image and video processing, 2023-06, Vol.17 (4), p.1173-1180</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</citedby><cites>FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</cites><orcidid>0000-0002-0944-2564</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11760-022-02324-x$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11760-022-02324-x$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Cao, Haiwen</creatorcontrib><creatorcontrib>Wu, Chunlei</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><creatorcontrib>Wu, Jie</creatorcontrib><creatorcontrib>Wang, Leiquan</creatorcontrib><title>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</title><title>Signal, image and video processing</title><addtitle>SIViP</addtitle><description>Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.</description><subject>Activity recognition</subject><subject>Computer Imaging</subject><subject>Computer Science</subject><subject>Constraint modelling</subject><subject>Image Processing and Computer Vision</subject><subject>Multimedia Information Systems</subject><subject>Optical flow (image analysis)</subject><subject>Original Paper</subject><subject>Pattern Recognition and Graphics</subject><subject>Signal,Image and Speech Processing</subject><subject>Streams</subject><subject>Vision</subject><issn>1863-1703</issn><issn>1863-1711</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KAzEUhYMoWGpfwFXA9Wh-OpPUnRT_oOBCXYdMkqmp02RMUjruXPgGvqFPYuyI7rwQ7oF8517uAeAYo1OMEDuLGLMKFYiQ_CiZFv0eGGFe0QIzjPd_NaKHYBLjCuWihPGKj8D7fSeTle3n20cy684H2ULrVkYl6x10Jm19eD6Hpu9ab5N1Syg3vW2tDK-w9TGaCBsfoBz4YJRfOrvTW5ueoOw6GYxLUNumMVkpA6XTMJq2KWRK-SuzR-CgkW00k58-Bo9Xlw_zm2Jxd307v1gUiuJZKlSJUI1mrFZGaUQlnWJGaKk1ZporRrFSuqyqmpdEMV7imaKIyLrhmkmDpaRjcDLM7YJ_2ZiYxMpvgssrBeGonHJOMM0UGSgV8oHBNKILdp0PFhiJ78DFELjIgYtd4KLPJjqYYobd0oS_0f-4vgDTq4ia</recordid><startdate>20230601</startdate><enddate>20230601</enddate><creator>Cao, Haiwen</creator><creator>Wu, Chunlei</creator><creator>Lu, Jing</creator><creator>Wu, Jie</creator><creator>Wang, Leiquan</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-0944-2564</orcidid></search><sort><creationdate>20230601</creationdate><title>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</title><author>Cao, Haiwen ; Wu, Chunlei ; Lu, Jing ; Wu, Jie ; Wang, Leiquan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Activity recognition</topic><topic>Computer Imaging</topic><topic>Computer Science</topic><topic>Constraint modelling</topic><topic>Image Processing and Computer Vision</topic><topic>Multimedia Information Systems</topic><topic>Optical flow (image analysis)</topic><topic>Original Paper</topic><topic>Pattern Recognition and Graphics</topic><topic>Signal,Image and Speech Processing</topic><topic>Streams</topic><topic>Vision</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Cao, Haiwen</creatorcontrib><creatorcontrib>Wu, Chunlei</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><creatorcontrib>Wu, Jie</creatorcontrib><creatorcontrib>Wang, Leiquan</creatorcontrib><collection>CrossRef</collection><jtitle>Signal, image and video processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Cao, Haiwen</au><au>Wu, Chunlei</au><au>Lu, Jing</au><au>Wu, Jie</au><au>Wang, Leiquan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</atitle><jtitle>Signal, image and video processing</jtitle><stitle>SIViP</stitle><date>2023-06-01</date><risdate>2023</risdate><volume>17</volume><issue>4</issue><spage>1173</spage><epage>1180</epage><pages>1173-1180</pages><issn>1863-1703</issn><eissn>1863-1711</eissn><abstract>Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s11760-022-02324-x</doi><tpages>8</tpages><orcidid>https://orcid.org/0000-0002-0944-2564</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1863-1703
ispartof Signal, image and video processing, 2023-06, Vol.17 (4), p.1173-1180
issn 1863-1703
1863-1711
language eng
recordid cdi_proquest_journals_2805488213
source SpringerNature Journals
subjects Activity recognition
Computer Imaging
Computer Science
Constraint modelling
Image Processing and Computer Vision
Multimedia Information Systems
Optical flow (image analysis)
Original Paper
Pattern Recognition and Graphics
Signal,Image and Speech Processing
Streams
Vision
title Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A49%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Spatial%E2%80%93temporal%20injection%20network:%20exploiting%20auxiliary%20losses%20for%20action%20recognition%20with%20apparent%20difference%20and%20self-attention&rft.jtitle=Signal,%20image%20and%20video%20processing&rft.au=Cao,%20Haiwen&rft.date=2023-06-01&rft.volume=17&rft.issue=4&rft.spage=1173&rft.epage=1180&rft.pages=1173-1180&rft.issn=1863-1703&rft.eissn=1863-1711&rft_id=info:doi/10.1007/s11760-022-02324-x&rft_dat=%3Cproquest_cross%3E2805488213%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2805488213&rft_id=info:pmid/&rfr_iscdi=true