Hierarchical Feature Aggregation Networks for Video Action Recognition

Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions. The first assume independence among frame features up to a certain level of abstraction and then...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Sudhakaran, Swathikiran, Escalera, Sergio, Lanz, Oswald
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Sudhakaran, Swathikiran Escalera, Sergio Lanz, Oswald
description	Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions. The first assume independence among frame features up to a certain level of abstraction and then perform higher-level aggregation, while the second extracts spatio-temporal features from grouped frames as early fusion. In this paper we explore the space in between these two, by letting adjacent feature branches interact as they develop into the higher level representation. The interaction happens between feature differencing and averaging at each level of the hierarchy, and it has convolutional structure that learns to select the appropriate mode locally in contrast to previous works that impose one of the modes globally (e.g. feature differencing) as a design choice. We further constrain this interaction to be conservative, e.g. a local feature subtraction in one branch is compensated by the addition on another, such that the total feature flow is preserved. We evaluate the performance of our proposal on a number of existing models, i.e. TSN, TRN and ECO, to show its flexibility and effectiveness in improving action recognition performance.
doi_str_mv	10.48550/arxiv.1905.12462
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1905_12462</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1905_12462</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-76a378d645697a3867fcada9685913a06e12cf5ec2efa750708edc49b71536523</originalsourceid><addsrcrecordid>eNotz0tLw0AUBeDZuJDqD3Dl_IHEeeTOYxmKsUKxIMVtuE7upENrR6bx9e-l0dU5cODAx9iNFHXjAMQdlu_0WUsvoJaqMeqSdatEBUvYpYAH3hFOH4V4O46FRpxSPvInmr5y2Z94zIW_pIEyb8O8PFPI4zGd-xW7iHg40fV_Lti2u98uV9V68_C4bNcVGqsqa1BbN5gGjLeonbEx4IDeOPBSozAkVYhAQVFEC8IKR0No_KuVoA0ovWC3f7czpH8v6Q3LT38G9TNI_wL3n0Wi</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Hierarchical Feature Aggregation Networks for Video Action Recognition</title><source>arXiv.org</source><creator>Sudhakaran, Swathikiran ; Escalera, Sergio ; Lanz, Oswald</creator><creatorcontrib>Sudhakaran, Swathikiran ; Escalera, Sergio ; Lanz, Oswald</creatorcontrib><description>Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions. The first assume independence among frame features up to a certain level of abstraction and then perform higher-level aggregation, while the second extracts spatio-temporal features from grouped frames as early fusion. In this paper we explore the space in between these two, by letting adjacent feature branches interact as they develop into the higher level representation. The interaction happens between feature differencing and averaging at each level of the hierarchy, and it has convolutional structure that learns to select the appropriate mode locally in contrast to previous works that impose one of the modes globally (e.g. feature differencing) as a design choice. We further constrain this interaction to be conservative, e.g. a local feature subtraction in one branch is compensated by the addition on another, such that the total feature flow is preserved. We evaluate the performance of our proposal on a number of existing models, i.e. TSN, TRN and ECO, to show its flexibility and effectiveness in improving action recognition performance.</description><identifier>DOI: 10.48550/arxiv.1905.12462</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2019-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1905.12462$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1905.12462$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Sudhakaran, Swathikiran</creatorcontrib><creatorcontrib>Escalera, Sergio</creatorcontrib><creatorcontrib>Lanz, Oswald</creatorcontrib><title>Hierarchical Feature Aggregation Networks for Video Action Recognition</title><description>Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions. The first assume independence among frame features up to a certain level of abstraction and then perform higher-level aggregation, while the second extracts spatio-temporal features from grouped frames as early fusion. In this paper we explore the space in between these two, by letting adjacent feature branches interact as they develop into the higher level representation. The interaction happens between feature differencing and averaging at each level of the hierarchy, and it has convolutional structure that learns to select the appropriate mode locally in contrast to previous works that impose one of the modes globally (e.g. feature differencing) as a design choice. We further constrain this interaction to be conservative, e.g. a local feature subtraction in one branch is compensated by the addition on another, such that the total feature flow is preserved. We evaluate the performance of our proposal on a number of existing models, i.e. TSN, TRN and ECO, to show its flexibility and effectiveness in improving action recognition performance.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz0tLw0AUBeDZuJDqD3Dl_IHEeeTOYxmKsUKxIMVtuE7upENrR6bx9e-l0dU5cODAx9iNFHXjAMQdlu_0WUsvoJaqMeqSdatEBUvYpYAH3hFOH4V4O46FRpxSPvInmr5y2Z94zIW_pIEyb8O8PFPI4zGd-xW7iHg40fV_Lti2u98uV9V68_C4bNcVGqsqa1BbN5gGjLeonbEx4IDeOPBSozAkVYhAQVFEC8IKR0No_KuVoA0ovWC3f7czpH8v6Q3LT38G9TNI_wL3n0Wi</recordid><startdate>20190529</startdate><enddate>20190529</enddate><creator>Sudhakaran, Swathikiran</creator><creator>Escalera, Sergio</creator><creator>Lanz, Oswald</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20190529</creationdate><title>Hierarchical Feature Aggregation Networks for Video Action Recognition</title><author>Sudhakaran, Swathikiran ; Escalera, Sergio ; Lanz, Oswald</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-76a378d645697a3867fcada9685913a06e12cf5ec2efa750708edc49b71536523</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Sudhakaran, Swathikiran</creatorcontrib><creatorcontrib>Escalera, Sergio</creatorcontrib><creatorcontrib>Lanz, Oswald</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Sudhakaran, Swathikiran</au><au>Escalera, Sergio</au><au>Lanz, Oswald</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Hierarchical Feature Aggregation Networks for Video Action Recognition</atitle><date>2019-05-29</date><risdate>2019</risdate><abstract>Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions. The first assume independence among frame features up to a certain level of abstraction and then perform higher-level aggregation, while the second extracts spatio-temporal features from grouped frames as early fusion. In this paper we explore the space in between these two, by letting adjacent feature branches interact as they develop into the higher level representation. The interaction happens between feature differencing and averaging at each level of the hierarchy, and it has convolutional structure that learns to select the appropriate mode locally in contrast to previous works that impose one of the modes globally (e.g. feature differencing) as a design choice. We further constrain this interaction to be conservative, e.g. a local feature subtraction in one branch is compensated by the addition on another, such that the total feature flow is preserved. We evaluate the performance of our proposal on a number of existing models, i.e. TSN, TRN and ECO, to show its flexibility and effectiveness in improving action recognition performance.</abstract><doi>10.48550/arxiv.1905.12462</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.1905.12462
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_1905_12462
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Hierarchical Feature Aggregation Networks for Video Action Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T19%3A14%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Hierarchical%20Feature%20Aggregation%20Networks%20for%20Video%20Action%20Recognition&rft.au=Sudhakaran,%20Swathikiran&rft.date=2019-05-29&rft_id=info:doi/10.48550/arxiv.1905.12462&rft_dat=%3Carxiv_GOX%3E1905_12462%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true