Action Anticipation Using Pairwise Human-Object Interactions and Transformers

The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object inte...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing 2021, Vol.30, p.8116-8129
Hauptverfasser: Roy, Debaditya, Fernando, Basura
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 8129
container_issue
container_start_page 8116
container_title IEEE transactions on image processing
container_volume 30
creator Roy, Debaditya
Fernando, Basura
description The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt ) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.
doi_str_mv 10.1109/TIP.2021.3113114
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2577554819</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9546623</ieee_id><sourcerecordid>2575829334</sourcerecordid><originalsourceid>FETCH-LOGICAL-c324t-d6132aafb856c852107951058210f70f25e0f52f7a6e38b1bee047b398bd8db73</originalsourceid><addsrcrecordid>eNpdkM1LAzEQxYMoVqt3wcuCFy9b87lJjqWoLVTaQ3sO2d1ZSWmzNdlF_O9NP_AgDMwM_N7j8RB6IHhECNYvq9lyRDElI0ZIGn6BbojmJMeY08t0YyFzSbgeoNsYNxgTLkhxjQaMC4GV4jfoY1x1rvXZ2Heucnt7fNbR-c9saV34dhGyab-zPl-UG6i6bOY7CPYoipn1dbYK1semDTsI8Q5dNXYb4f68h2j99rqaTPP54n02Gc_zilHe5XVBGLW2KZUoKiUowVKLFFalq5G4oQJwI2gjbQFMlaQEwFyWTKuyVnUp2RA9n3z3of3qIXZm52IF26310PbRUCGTmWaMJ_TpH7pp--BTugMlheCK6EThE1WFNsYAjdkHt7PhxxBsDlWbVLU5VG3OVSfJ40niAOAP14IXBWXsF5KFdwI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2577554819</pqid></control><display><type>article</type><title>Action Anticipation Using Pairwise Human-Object Interactions and Transformers</title><source>IEEE Electronic Library (IEL)</source><creator>Roy, Debaditya ; Fernando, Basura</creator><creatorcontrib>Roy, Debaditya ; Fernando, Basura</creatorcontrib><description>The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt ) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2021.3113114</identifier><identifier>PMID: 34550884</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Convolutional codes ; Cross correlation ; Datasets ; Feature extraction ; Image motion analysis ; image representation ; Image sequence analysis ; Object detection ; object recognition ; Predictive models ; Representations ; Salads ; Smart buildings ; Transformers ; Visualization</subject><ispartof>IEEE transactions on image processing, 2021, Vol.30, p.8116-8129</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c324t-d6132aafb856c852107951058210f70f25e0f52f7a6e38b1bee047b398bd8db73</citedby><cites>FETCH-LOGICAL-c324t-d6132aafb856c852107951058210f70f25e0f52f7a6e38b1bee047b398bd8db73</cites><orcidid>0000-0002-6920-9916 ; 0000-0002-8779-1241</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9546623$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9546623$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Roy, Debaditya</creatorcontrib><creatorcontrib>Fernando, Basura</creatorcontrib><title>Action Anticipation Using Pairwise Human-Object Interactions and Transformers</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><description>The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt ) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.</description><subject>Convolutional codes</subject><subject>Cross correlation</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Image motion analysis</subject><subject>image representation</subject><subject>Image sequence analysis</subject><subject>Object detection</subject><subject>object recognition</subject><subject>Predictive models</subject><subject>Representations</subject><subject>Salads</subject><subject>Smart buildings</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkM1LAzEQxYMoVqt3wcuCFy9b87lJjqWoLVTaQ3sO2d1ZSWmzNdlF_O9NP_AgDMwM_N7j8RB6IHhECNYvq9lyRDElI0ZIGn6BbojmJMeY08t0YyFzSbgeoNsYNxgTLkhxjQaMC4GV4jfoY1x1rvXZ2Heucnt7fNbR-c9saV34dhGyab-zPl-UG6i6bOY7CPYoipn1dbYK1semDTsI8Q5dNXYb4f68h2j99rqaTPP54n02Gc_zilHe5XVBGLW2KZUoKiUowVKLFFalq5G4oQJwI2gjbQFMlaQEwFyWTKuyVnUp2RA9n3z3of3qIXZm52IF26310PbRUCGTmWaMJ_TpH7pp--BTugMlheCK6EThE1WFNsYAjdkHt7PhxxBsDlWbVLU5VG3OVSfJ40niAOAP14IXBWXsF5KFdwI</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Roy, Debaditya</creator><creator>Fernando, Basura</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-6920-9916</orcidid><orcidid>https://orcid.org/0000-0002-8779-1241</orcidid></search><sort><creationdate>2021</creationdate><title>Action Anticipation Using Pairwise Human-Object Interactions and Transformers</title><author>Roy, Debaditya ; Fernando, Basura</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c324t-d6132aafb856c852107951058210f70f25e0f52f7a6e38b1bee047b398bd8db73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Convolutional codes</topic><topic>Cross correlation</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Image motion analysis</topic><topic>image representation</topic><topic>Image sequence analysis</topic><topic>Object detection</topic><topic>object recognition</topic><topic>Predictive models</topic><topic>Representations</topic><topic>Salads</topic><topic>Smart buildings</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Roy, Debaditya</creatorcontrib><creatorcontrib>Fernando, Basura</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Roy, Debaditya</au><au>Fernando, Basura</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Action Anticipation Using Pairwise Human-Object Interactions and Transformers</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><date>2021</date><risdate>2021</risdate><volume>30</volume><spage>8116</spage><epage>8129</epage><pages>8116-8129</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt ) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.</abstract><cop>New York</cop><pub>IEEE</pub><pmid>34550884</pmid><doi>10.1109/TIP.2021.3113114</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-6920-9916</orcidid><orcidid>https://orcid.org/0000-0002-8779-1241</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1057-7149
ispartof IEEE transactions on image processing, 2021, Vol.30, p.8116-8129
issn 1057-7149
1941-0042
language eng
recordid cdi_proquest_journals_2577554819
source IEEE Electronic Library (IEL)
subjects Convolutional codes
Cross correlation
Datasets
Feature extraction
Image motion analysis
image representation
Image sequence analysis
Object detection
object recognition
Predictive models
Representations
Salads
Smart buildings
Transformers
Visualization
title Action Anticipation Using Pairwise Human-Object Interactions and Transformers
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T13%3A42%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Action%20Anticipation%20Using%20Pairwise%20Human-Object%20Interactions%20and%20Transformers&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Roy,%20Debaditya&rft.date=2021&rft.volume=30&rft.spage=8116&rft.epage=8129&rft.pages=8116-8129&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2021.3113114&rft_dat=%3Cproquest_RIE%3E2575829334%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2577554819&rft_id=info:pmid/34550884&rft_ieee_id=9546623&rfr_iscdi=true