Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Two-stream networks have provided an alternate way of exploiting the spatiotemporal information for action recognition problem. Nevertheless, most of the two-stream variants perform the fusion of homogeneous modalities which cannot efficiently capture the action-motion dynamics from the videos. More...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural computing & applications 2020-07, Vol.32 (14), p.10423-10434
Hauptverfasser:	Khowaja, Sunder Ali, Lee, Seok-Lyong
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Computational Biology/Bioinformatics Computational Science and Engineering Computer architecture Computer Science Data Mining and Knowledge Discovery Image Processing and Computer Vision Learning Modelling Networks Optical flow (image analysis) Original Article Probability and Statistics in Computer Science Recognition Streams
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	10434
container_issue	14
container_start_page	10423
container_title	Neural computing & applications
container_volume	32
creator	Khowaja, Sunder Ali Lee, Seok-Lyong
description	Two-stream networks have provided an alternate way of exploiting the spatiotemporal information for action recognition problem. Nevertheless, most of the two-stream variants perform the fusion of homogeneous modalities which cannot efficiently capture the action-motion dynamics from the videos. Moreover, the existing studies cannot extend the streams beyond the number of modalities. To address these limitations, we propose a hybrid and hierarchical fusion (HHF) networks. The hybrid fusion handles non-homogeneous modalities and introduces a cross-modal learning stream for effective modeling of motion dynamics while extending the networks from existing two-stream variants to three and six streams. On the other hand, the hierarchical fusion makes the modalities consistent by modeling long-term temporal information along with the combination of multiple streams to improve the recognition performance. The proposed network architecture comprises of three fusion tiers: the hybrid fusion itself, the long-term fusion pooling layer which models the long-term dynamics from RGB and optical flow modalities, and the adaptive weighting scheme for combining the classification scores from several streams. We show that the hybrid fusion has different representations from the base modalities for training the cross-modal learning stream. We have conducted extensive experiments and shown that the proposed six-stream HHF network outperforms the existing two- and four-stream networks, achieving the state-of-the-art recognition performance, 97.2% and 76.7% accuracies on UCF101 and HMDB51 datasets, respectively, which are widely used in action recognition studies.
doi_str_mv	10.1007/s00521-019-04578-y
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2418452440</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2418452440</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-870c1b8725e6424c52ddc32ee24da8263d263cfa6171aadc4203ece0c8cf33d03</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWD_-gKeA5-jkY3dTb1LUCgUveg5pMttubZOa7CL99-52BW8ehhmY93mHeQm54XDHAar7DFAIzoBPGaii0uxwQiZcSckkFPqUTGCq-nWp5Dm5yHkDAKrUxYSs54dlajy1wdN1g8kmt26c3dK6y00MNGD7HdNnfqCWesQ9dSnmzHbR95ot2hSasKJHqkXXdglpHRO1rh3ohC6uQjPMV-SsttuM17_9knw8P73P5mzx9vI6e1wwJ_m0ZboCx5e6EgWWSihXCO-dFIhCeatFKX1frrYlr7i13ikBEh2C066W0oO8JLej7z7Frw5zazaxS6E_aYTiWhVCqUElRtXxnYS12admZ9PBcDBDomZM1PSJmmOi5tBDcoRyLw4rTH_W_1A_Pet7aQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2418452440</pqid></control><display><type>article</type><title>Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition</title><source>SpringerLink Journals - AutoHoldings</source><creator>Khowaja, Sunder Ali ; Lee, Seok-Lyong</creator><creatorcontrib>Khowaja, Sunder Ali ; Lee, Seok-Lyong</creatorcontrib><description>Two-stream networks have provided an alternate way of exploiting the spatiotemporal information for action recognition problem. Nevertheless, most of the two-stream variants perform the fusion of homogeneous modalities which cannot efficiently capture the action-motion dynamics from the videos. Moreover, the existing studies cannot extend the streams beyond the number of modalities. To address these limitations, we propose a hybrid and hierarchical fusion (HHF) networks. The hybrid fusion handles non-homogeneous modalities and introduces a cross-modal learning stream for effective modeling of motion dynamics while extending the networks from existing two-stream variants to three and six streams. On the other hand, the hierarchical fusion makes the modalities consistent by modeling long-term temporal information along with the combination of multiple streams to improve the recognition performance. The proposed network architecture comprises of three fusion tiers: the hybrid fusion itself, the long-term fusion pooling layer which models the long-term dynamics from RGB and optical flow modalities, and the adaptive weighting scheme for combining the classification scores from several streams. We show that the hybrid fusion has different representations from the base modalities for training the cross-modal learning stream. We have conducted extensive experiments and shown that the proposed six-stream HHF network outperforms the existing two- and four-stream networks, achieving the state-of-the-art recognition performance, 97.2% and 76.7% accuracies on UCF101 and HMDB51 datasets, respectively, which are widely used in action recognition studies.</description><identifier>ISSN: 0941-0643</identifier><identifier>EISSN: 1433-3058</identifier><identifier>DOI: 10.1007/s00521-019-04578-y</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Artificial Intelligence ; Computational Biology/Bioinformatics ; Computational Science and Engineering ; Computer architecture ; Computer Science ; Data Mining and Knowledge Discovery ; Image Processing and Computer Vision ; Learning ; Modelling ; Networks ; Optical flow (image analysis) ; Original Article ; Probability and Statistics in Computer Science ; Recognition ; Streams</subject><ispartof>Neural computing & applications, 2020-07, Vol.32 (14), p.10423-10434</ispartof><rights>Springer-Verlag London Ltd., part of Springer Nature 2019</rights><rights>Springer-Verlag London Ltd., part of Springer Nature 2019.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-870c1b8725e6424c52ddc32ee24da8263d263cfa6171aadc4203ece0c8cf33d03</citedby><cites>FETCH-LOGICAL-c319t-870c1b8725e6424c52ddc32ee24da8263d263cfa6171aadc4203ece0c8cf33d03</cites><orcidid>0000-0002-8630-5395</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s00521-019-04578-y$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s00521-019-04578-y$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27922,27923,41486,42555,51317</link.rule.ids></links><search><creatorcontrib>Khowaja, Sunder Ali</creatorcontrib><creatorcontrib>Lee, Seok-Lyong</creatorcontrib><title>Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition</title><title>Neural computing & applications</title><addtitle>Neural Comput & Applic</addtitle><description>Two-stream networks have provided an alternate way of exploiting the spatiotemporal information for action recognition problem. Nevertheless, most of the two-stream variants perform the fusion of homogeneous modalities which cannot efficiently capture the action-motion dynamics from the videos. Moreover, the existing studies cannot extend the streams beyond the number of modalities. To address these limitations, we propose a hybrid and hierarchical fusion (HHF) networks. The hybrid fusion handles non-homogeneous modalities and introduces a cross-modal learning stream for effective modeling of motion dynamics while extending the networks from existing two-stream variants to three and six streams. On the other hand, the hierarchical fusion makes the modalities consistent by modeling long-term temporal information along with the combination of multiple streams to improve the recognition performance. The proposed network architecture comprises of three fusion tiers: the hybrid fusion itself, the long-term fusion pooling layer which models the long-term dynamics from RGB and optical flow modalities, and the adaptive weighting scheme for combining the classification scores from several streams. We show that the hybrid fusion has different representations from the base modalities for training the cross-modal learning stream. We have conducted extensive experiments and shown that the proposed six-stream HHF network outperforms the existing two- and four-stream networks, achieving the state-of-the-art recognition performance, 97.2% and 76.7% accuracies on UCF101 and HMDB51 datasets, respectively, which are widely used in action recognition studies.</description><subject>Artificial Intelligence</subject><subject>Computational Biology/Bioinformatics</subject><subject>Computational Science and Engineering</subject><subject>Computer architecture</subject><subject>Computer Science</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Image Processing and Computer Vision</subject><subject>Learning</subject><subject>Modelling</subject><subject>Networks</subject><subject>Optical flow (image analysis)</subject><subject>Original Article</subject><subject>Probability and Statistics in Computer Science</subject><subject>Recognition</subject><subject>Streams</subject><issn>0941-0643</issn><issn>1433-3058</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>AFKRA</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNp9kE1LAzEQhoMoWD_-gKeA5-jkY3dTb1LUCgUveg5pMttubZOa7CL99-52BW8ehhmY93mHeQm54XDHAar7DFAIzoBPGaii0uxwQiZcSckkFPqUTGCq-nWp5Dm5yHkDAKrUxYSs54dlajy1wdN1g8kmt26c3dK6y00MNGD7HdNnfqCWesQ9dSnmzHbR95ot2hSasKJHqkXXdglpHRO1rh3ohC6uQjPMV-SsttuM17_9knw8P73P5mzx9vI6e1wwJ_m0ZboCx5e6EgWWSihXCO-dFIhCeatFKX1frrYlr7i13ikBEh2C066W0oO8JLej7z7Frw5zazaxS6E_aYTiWhVCqUElRtXxnYS12admZ9PBcDBDomZM1PSJmmOi5tBDcoRyLw4rTH_W_1A_Pet7aQ</recordid><startdate>20200701</startdate><enddate>20200701</enddate><creator>Khowaja, Sunder Ali</creator><creator>Lee, Seok-Lyong</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><orcidid>https://orcid.org/0000-0002-8630-5395</orcidid></search><sort><creationdate>20200701</creationdate><title>Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition</title><author>Khowaja, Sunder Ali ; Lee, Seok-Lyong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-870c1b8725e6424c52ddc32ee24da8263d263cfa6171aadc4203ece0c8cf33d03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Artificial Intelligence</topic><topic>Computational Biology/Bioinformatics</topic><topic>Computational Science and Engineering</topic><topic>Computer architecture</topic><topic>Computer Science</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Image Processing and Computer Vision</topic><topic>Learning</topic><topic>Modelling</topic><topic>Networks</topic><topic>Optical flow (image analysis)</topic><topic>Original Article</topic><topic>Probability and Statistics in Computer Science</topic><topic>Recognition</topic><topic>Streams</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Khowaja, Sunder Ali</creatorcontrib><creatorcontrib>Lee, Seok-Lyong</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Neural computing & applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Khowaja, Sunder Ali</au><au>Lee, Seok-Lyong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition</atitle><jtitle>Neural computing & applications</jtitle><stitle>Neural Comput & Applic</stitle><date>2020-07-01</date><risdate>2020</risdate><volume>32</volume><issue>14</issue><spage>10423</spage><epage>10434</epage><pages>10423-10434</pages><issn>0941-0643</issn><eissn>1433-3058</eissn><abstract>Two-stream networks have provided an alternate way of exploiting the spatiotemporal information for action recognition problem. Nevertheless, most of the two-stream variants perform the fusion of homogeneous modalities which cannot efficiently capture the action-motion dynamics from the videos. Moreover, the existing studies cannot extend the streams beyond the number of modalities. To address these limitations, we propose a hybrid and hierarchical fusion (HHF) networks. The hybrid fusion handles non-homogeneous modalities and introduces a cross-modal learning stream for effective modeling of motion dynamics while extending the networks from existing two-stream variants to three and six streams. On the other hand, the hierarchical fusion makes the modalities consistent by modeling long-term temporal information along with the combination of multiple streams to improve the recognition performance. The proposed network architecture comprises of three fusion tiers: the hybrid fusion itself, the long-term fusion pooling layer which models the long-term dynamics from RGB and optical flow modalities, and the adaptive weighting scheme for combining the classification scores from several streams. We show that the hybrid fusion has different representations from the base modalities for training the cross-modal learning stream. We have conducted extensive experiments and shown that the proposed six-stream HHF network outperforms the existing two- and four-stream networks, achieving the state-of-the-art recognition performance, 97.2% and 76.7% accuracies on UCF101 and HMDB51 datasets, respectively, which are widely used in action recognition studies.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s00521-019-04578-y</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-8630-5395</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0941-0643
ispartof	Neural computing & applications, 2020-07, Vol.32 (14), p.10423-10434
issn	0941-0643 1433-3058
language	eng
recordid	cdi_proquest_journals_2418452440
source	SpringerLink Journals - AutoHoldings
subjects	Artificial Intelligence Computational Biology/Bioinformatics Computational Science and Engineering Computer architecture Computer Science Data Mining and Knowledge Discovery Image Processing and Computer Vision Learning Modelling Networks Optical flow (image analysis) Original Article Probability and Statistics in Computer Science Recognition Streams
title	Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T18%3A38%3A24IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Hybrid%20and%20hierarchical%20fusion%20networks:%20a%20deep%20cross-modal%20learning%20architecture%20for%20action%20recognition&rft.jtitle=Neural%20computing%20&%20applications&rft.au=Khowaja,%20Sunder%20Ali&rft.date=2020-07-01&rft.volume=32&rft.issue=14&rft.spage=10423&rft.epage=10434&rft.pages=10423-10434&rft.issn=0941-0643&rft.eissn=1433-3058&rft_id=info:doi/10.1007/s00521-019-04578-y&rft_dat=%3Cproquest_cross%3E2418452440%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2418452440&rft_id=info:pmid/&rfr_iscdi=true