Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention
Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed o...
Gespeichert in:
Veröffentlicht in: | Signal, image and video processing image and video processing, 2023-06, Vol.17 (4), p.1173-1180 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1180 |
---|---|
container_issue | 4 |
container_start_page | 1173 |
container_title | Signal, image and video processing |
container_volume | 17 |
creator | Cao, Haiwen Wu, Chunlei Lu, Jing Wu, Jie Wang, Leiquan |
description | Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN. |
doi_str_mv | 10.1007/s11760-022-02324-x |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2805488213</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2805488213</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</originalsourceid><addsrcrecordid>eNp9kM1KAzEUhYMoWGpfwFXA9Wh-OpPUnRT_oOBCXYdMkqmp02RMUjruXPgGvqFPYuyI7rwQ7oF8517uAeAYo1OMEDuLGLMKFYiQ_CiZFv0eGGFe0QIzjPd_NaKHYBLjCuWihPGKj8D7fSeTle3n20cy684H2ULrVkYl6x10Jm19eD6Hpu9ab5N1Syg3vW2tDK-w9TGaCBsfoBz4YJRfOrvTW5ueoOw6GYxLUNumMVkpA6XTMJq2KWRK-SuzR-CgkW00k58-Bo9Xlw_zm2Jxd307v1gUiuJZKlSJUI1mrFZGaUQlnWJGaKk1ZporRrFSuqyqmpdEMV7imaKIyLrhmkmDpaRjcDLM7YJ_2ZiYxMpvgssrBeGonHJOMM0UGSgV8oHBNKILdp0PFhiJ78DFELjIgYtd4KLPJjqYYobd0oS_0f-4vgDTq4ia</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2805488213</pqid></control><display><type>article</type><title>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</title><source>SpringerNature Journals</source><creator>Cao, Haiwen ; Wu, Chunlei ; Lu, Jing ; Wu, Jie ; Wang, Leiquan</creator><creatorcontrib>Cao, Haiwen ; Wu, Chunlei ; Lu, Jing ; Wu, Jie ; Wang, Leiquan</creatorcontrib><description>Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.</description><identifier>ISSN: 1863-1703</identifier><identifier>EISSN: 1863-1711</identifier><identifier>DOI: 10.1007/s11760-022-02324-x</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Activity recognition ; Computer Imaging ; Computer Science ; Constraint modelling ; Image Processing and Computer Vision ; Multimedia Information Systems ; Optical flow (image analysis) ; Original Paper ; Pattern Recognition and Graphics ; Signal,Image and Speech Processing ; Streams ; Vision</subject><ispartof>Signal, image and video processing, 2023-06, Vol.17 (4), p.1173-1180</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</citedby><cites>FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</cites><orcidid>0000-0002-0944-2564</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11760-022-02324-x$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11760-022-02324-x$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Cao, Haiwen</creatorcontrib><creatorcontrib>Wu, Chunlei</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><creatorcontrib>Wu, Jie</creatorcontrib><creatorcontrib>Wang, Leiquan</creatorcontrib><title>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</title><title>Signal, image and video processing</title><addtitle>SIViP</addtitle><description>Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.</description><subject>Activity recognition</subject><subject>Computer Imaging</subject><subject>Computer Science</subject><subject>Constraint modelling</subject><subject>Image Processing and Computer Vision</subject><subject>Multimedia Information Systems</subject><subject>Optical flow (image analysis)</subject><subject>Original Paper</subject><subject>Pattern Recognition and Graphics</subject><subject>Signal,Image and Speech Processing</subject><subject>Streams</subject><subject>Vision</subject><issn>1863-1703</issn><issn>1863-1711</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kM1KAzEUhYMoWGpfwFXA9Wh-OpPUnRT_oOBCXYdMkqmp02RMUjruXPgGvqFPYuyI7rwQ7oF8517uAeAYo1OMEDuLGLMKFYiQ_CiZFv0eGGFe0QIzjPd_NaKHYBLjCuWihPGKj8D7fSeTle3n20cy684H2ULrVkYl6x10Jm19eD6Hpu9ab5N1Syg3vW2tDK-w9TGaCBsfoBz4YJRfOrvTW5ueoOw6GYxLUNumMVkpA6XTMJq2KWRK-SuzR-CgkW00k58-Bo9Xlw_zm2Jxd307v1gUiuJZKlSJUI1mrFZGaUQlnWJGaKk1ZporRrFSuqyqmpdEMV7imaKIyLrhmkmDpaRjcDLM7YJ_2ZiYxMpvgssrBeGonHJOMM0UGSgV8oHBNKILdp0PFhiJ78DFELjIgYtd4KLPJjqYYobd0oS_0f-4vgDTq4ia</recordid><startdate>20230601</startdate><enddate>20230601</enddate><creator>Cao, Haiwen</creator><creator>Wu, Chunlei</creator><creator>Lu, Jing</creator><creator>Wu, Jie</creator><creator>Wang, Leiquan</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-0944-2564</orcidid></search><sort><creationdate>20230601</creationdate><title>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</title><author>Cao, Haiwen ; Wu, Chunlei ; Lu, Jing ; Wu, Jie ; Wang, Leiquan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-c500b097bcecd03a3417235dd17d8c731ccd566b852c78519c302abf8d7ae1aa3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Activity recognition</topic><topic>Computer Imaging</topic><topic>Computer Science</topic><topic>Constraint modelling</topic><topic>Image Processing and Computer Vision</topic><topic>Multimedia Information Systems</topic><topic>Optical flow (image analysis)</topic><topic>Original Paper</topic><topic>Pattern Recognition and Graphics</topic><topic>Signal,Image and Speech Processing</topic><topic>Streams</topic><topic>Vision</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Cao, Haiwen</creatorcontrib><creatorcontrib>Wu, Chunlei</creatorcontrib><creatorcontrib>Lu, Jing</creatorcontrib><creatorcontrib>Wu, Jie</creatorcontrib><creatorcontrib>Wang, Leiquan</creatorcontrib><collection>CrossRef</collection><jtitle>Signal, image and video processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Cao, Haiwen</au><au>Wu, Chunlei</au><au>Lu, Jing</au><au>Wu, Jie</au><au>Wang, Leiquan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention</atitle><jtitle>Signal, image and video processing</jtitle><stitle>SIViP</stitle><date>2023-06-01</date><risdate>2023</risdate><volume>17</volume><issue>4</issue><spage>1173</spage><epage>1180</epage><pages>1173-1180</pages><issn>1863-1703</issn><eissn>1863-1711</eissn><abstract>Two-stream convolutional networks have shown strong performance in action recognition. However, both spatial and temporal features in two-stream are learned separately. There has been almost no consideration for the different characteristics of the spatial and temporal streams, which are performed on the same operations. In this paper, we build upon two-stream convolutional networks and propose a novel spatial–temporal injection network (STIN) with two different auxiliary losses. To build spatial–temporal features as the video representation, the apparent difference module is designed to model the auxiliary temporal constraints on spatial features in spatial injection network. The self-attention mechanism is used to attend to the interested areas in the temporal injection stream, which reduces the optical flow noise influence of uninterested region. Then, these auxiliary losses enable efficient training of two complementary streams which can capture interactions between the spatial and temporal information from different perspectives. Experiments conducted on the two well-known datasets—UCF101 and HMDB51—demonstrate the effectiveness of the proposed STIN.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s11760-022-02324-x</doi><tpages>8</tpages><orcidid>https://orcid.org/0000-0002-0944-2564</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1863-1703 |
ispartof | Signal, image and video processing, 2023-06, Vol.17 (4), p.1173-1180 |
issn | 1863-1703 1863-1711 |
language | eng |
recordid | cdi_proquest_journals_2805488213 |
source | SpringerNature Journals |
subjects | Activity recognition Computer Imaging Computer Science Constraint modelling Image Processing and Computer Vision Multimedia Information Systems Optical flow (image analysis) Original Paper Pattern Recognition and Graphics Signal,Image and Speech Processing Streams Vision |
title | Spatial–temporal injection network: exploiting auxiliary losses for action recognition with apparent difference and self-attention |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A49%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Spatial%E2%80%93temporal%20injection%20network:%20exploiting%20auxiliary%20losses%20for%20action%20recognition%20with%20apparent%20difference%20and%20self-attention&rft.jtitle=Signal,%20image%20and%20video%20processing&rft.au=Cao,%20Haiwen&rft.date=2023-06-01&rft.volume=17&rft.issue=4&rft.spage=1173&rft.epage=1180&rft.pages=1173-1180&rft.issn=1863-1703&rft.eissn=1863-1711&rft_id=info:doi/10.1007/s11760-022-02324-x&rft_dat=%3Cproquest_cross%3E2805488213%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2805488213&rft_id=info:pmid/&rfr_iscdi=true |