Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention

Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed o...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Signal, image and video processing image and video processing, 2023-06, Vol.17 (4), p.1173-1180
Hauptverfasser:	Cao, Haiwen, Wu, Chunlei, Lu, Jing, Wu, Jie, Wang, Leiquan
Format:	Artikel
Sprache:	eng
Schlagworte:	Activity recognition Computer Imaging Computer Science Constraint modelling Image Processing and Computer Vision Multimedia Information Systems Optical flow (image analysis) Original Paper Pattern Recognition and Graphics Signal,Image and Speech Processing Streams Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1180
container_issue	4
container_start_page	1173
container_title	Signal, image and video processing
container_volume	17
creator	Cao, Haiwen Wu, Chunlei Lu, Jing Wu, Jie Wang, Leiquan
description	Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.
doi_str_mv	10.1007/s11760-022-02324-x
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2805488213</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2805488213</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</originalsourceid><addsrcrecordid>eNp9kM1KAzEUhYMoWGpfwFXA9Wh-OpPUnRT_oOBCXYdMkqmp02RMUjruXPgGvqFPYuyI7rwQ7oF8517uAeAYo1OMEDuLGLMKFYiQ_CiZFv0eGGFe0QIzjPd_NaKHYBLjCuWihPGKj8D7fSeTle3n20cy684H2ULrVkYl6x10Jm19eD6Hpu9ab5N1Syg3vW2tDK-w9TGaCBsfoBz4YJRfOrvTW5ueoOw6GYxLUNumMVkpA6XTMJq2KWRK-SuzR-CgkW00k58-Bo9Xlw_zm2Jxd307v1gUiuJZKlSJUI1mrFZGaUQlnWJGaKk1ZporRrFSuqyqmpdEMV7imaKIyLrhmkmDpaRjcDLM7YJ_2ZiYxMpvgssrBeGonHJOMM0UGSgV8oHBNKILdp0PFhiJ78DFELjIgYtd4KLPJjqYYobd0oS_0f-4vgDTq4ia</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2805488213</pqid></control><display><type>article</type><title>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</title><source>SpringerNature Journals</source><creator>Cao, Haiwen ; Wu, Chunlei ; Lu, Jing ; Wu, Jie ; Wang, Leiquan</creator><creatorcontrib>Cao, Haiwen ; Wu, Chunlei ; Lu, Jing ; Wu, Jie ; Wang, Leiquan</creatorcontrib><description>Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.</description><identifier>ISSN: 1863-1703</identifier><identifier>EISSN: 1863-1711</identifier><identifier>DOI: 10.1007/s11760-022-02324-x</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Activity recognition ; Computer Imaging ; Computer Science ; Constraint modelling ; Image Processing and Computer Vision ; Multimedia Information Systems ; Optical flow (image analysis) ; Original Paper ; Pattern Recognition and Graphics ; Signal,Image and Speech Processing ; Streams ; Vision</subject><ispartof>Signal, image and video processing, 2023-06, Vol.17 (4), p.1173-1180</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</citedby><cites>FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</cites><orcidid>0000-0002-0944-2564</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11760-022-02324-x$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11760-022-02324-x$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Cao, Haiwen</creatorcontrib><creatorcontrib>Wu, Chunlei</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><creatorcontrib>Wu, Jie</creatorcontrib><creatorcontrib>Wang, Leiquan</creatorcontrib><title>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</title><title>Signal, image and video processing</title><addtitle>SIViP</addtitle><description>Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.</description><subject>Activity recognition</subject><subject>Computer Imaging</subject><subject>Computer Science</subject><subject>Constraint modelling</subject><subject>Image Processing and Computer Vision</subject><subject>Multimedia Information Systems</subject><subject>Optical flow (image analysis)</subject><subject>Original Paper</subject><subject>Pattern Recognition and Graphics</subject><subject>Signal,Image and Speech Processing</subject><subject>Streams</subject><subject>Vision</subject><issn>1863-1703</issn><issn>1863-1711</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KAzEUhYMoWGpfwFXA9Wh-OpPUnRT_oOBCXYdMkqmp02RMUjruXPgGvqFPYuyI7rwQ7oF8517uAeAYo1OMEDuLGLMKFYiQ_CiZFv0eGGFe0QIzjPd_NaKHYBLjCuWihPGKj8D7fSeTle3n20cy684H2ULrVkYl6x10Jm19eD6Hpu9ab5N1Syg3vW2tDK-w9TGaCBsfoBz4YJRfOrvTW5ueoOw6GYxLUNumMVkpA6XTMJq2KWRK-SuzR-CgkW00k58-Bo9Xlw_zm2Jxd307v1gUiuJZKlSJUI1mrFZGaUQlnWJGaKk1ZporRrFSuqyqmpdEMV7imaKIyLrhmkmDpaRjcDLM7YJ_2ZiYxMpvgssrBeGonHJOMM0UGSgV8oHBNKILdp0PFhiJ78DFELjIgYtd4KLPJjqYYobd0oS_0f-4vgDTq4ia</recordid><startdate>20230601</startdate><enddate>20230601</enddate><creator>Cao, Haiwen</creator><creator>Wu, Chunlei</creator><creator>Lu, Jing</creator><creator>Wu, Jie</creator><creator>Wang, Leiquan</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-0944-2564</orcidid></search><sort><creationdate>20230601</creationdate><title>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</title><author>Cao, Haiwen ; Wu, Chunlei ; Lu, Jing ; Wu, Jie ; Wang, Leiquan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Activity recognition</topic><topic>Computer Imaging</topic><topic>Computer Science</topic><topic>Constraint modelling</topic><topic>Image Processing and Computer Vision</topic><topic>Multimedia Information Systems</topic><topic>Optical flow (image analysis)</topic><topic>Original Paper</topic><topic>Pattern Recognition and Graphics</topic><topic>Signal,Image and Speech Processing</topic><topic>Streams</topic><topic>Vision</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Cao, Haiwen</creatorcontrib><creatorcontrib>Wu, Chunlei</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><creatorcontrib>Wu, Jie</creatorcontrib><creatorcontrib>Wang, Leiquan</creatorcontrib><collection>CrossRef</collection><jtitle>Signal, image and video processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Cao, Haiwen</au><au>Wu, Chunlei</au><au>Lu, Jing</au><au>Wu, Jie</au><au>Wang, Leiquan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</atitle><jtitle>Signal, image and video processing</jtitle><stitle>SIViP</stitle><date>2023-06-01</date><risdate>2023</risdate><volume>17</volume><issue>4</issue><spage>1173</spage><epage>1180</epage><pages>1173-1180</pages><issn>1863-1703</issn><eissn>1863-1711</eissn><abstract>Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s11760-022-02324-x</doi><tpages>8</tpages><orcidid>https://orcid.org/0000-0002-0944-2564</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1863-1703
ispartof	Signal, image and video processing, 2023-06, Vol.17 (4), p.1173-1180
issn	1863-1703 1863-1711
language	eng
recordid	cdi_proquest_journals_2805488213
source	SpringerNature Journals
subjects	Activity recognition Computer Imaging Computer Science Constraint modelling Image Processing and Computer Vision Multimedia Information Systems Optical flow (image analysis) Original Paper Pattern Recognition and Graphics Signal,Image and Speech Processing Streams Vision
title	Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A49%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Spatial%E2%80%93temporal%20injection%20network:%20exploiting%20auxiliary%20losses%20for%20action%20recognition%20with%20apparent%20difference%20and%20self-attention&rft.jtitle=Signal,%20image%20and%20video%20processing&rft.au=Cao,%20Haiwen&rft.date=2023-06-01&rft.volume=17&rft.issue=4&rft.spage=1173&rft.epage=1180&rft.pages=1173-1180&rft.issn=1863-1703&rft.eissn=1863-1711&rft_id=info:doi/10.1007/s11760-022-02324-x&rft_dat=%3Cproquest_cross%3E2805488213%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2805488213&rft_id=info:pmid/&rfr_iscdi=true