Action Anticipation Using Pairwise Human-Object Interactions and Transformers
The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object inte...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on image processing 2021, Vol.30, p.8116-8129 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 8129 |
---|---|
container_issue | |
container_start_page | 8116 |
container_title | IEEE transactions on image processing |
container_volume | 30 |
creator | Roy, Debaditya Fernando, Basura |
description | The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt ) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation. |
doi_str_mv | 10.1109/TIP.2021.3113114 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2577554819</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9546623</ieee_id><sourcerecordid>2575829334</sourcerecordid><originalsourceid>FETCH-LOGICAL-c324t-d6132aafb856c852107951058210f70f25e0f52f7a6e38b1bee047b398bd8db73</originalsourceid><addsrcrecordid>eNpdkM1LAzEQxYMoVqt3wcuCFy9b87lJjqWoLVTaQ3sO2d1ZSWmzNdlF_O9NP_AgDMwM_N7j8RB6IHhECNYvq9lyRDElI0ZIGn6BbojmJMeY08t0YyFzSbgeoNsYNxgTLkhxjQaMC4GV4jfoY1x1rvXZ2Heucnt7fNbR-c9saV34dhGyab-zPl-UG6i6bOY7CPYoipn1dbYK1semDTsI8Q5dNXYb4f68h2j99rqaTPP54n02Gc_zilHe5XVBGLW2KZUoKiUowVKLFFalq5G4oQJwI2gjbQFMlaQEwFyWTKuyVnUp2RA9n3z3of3qIXZm52IF26310PbRUCGTmWaMJ_TpH7pp--BTugMlheCK6EThE1WFNsYAjdkHt7PhxxBsDlWbVLU5VG3OVSfJ40niAOAP14IXBWXsF5KFdwI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2577554819</pqid></control><display><type>article</type><title>Action Anticipation Using Pairwise Human-Object Interactions and Transformers</title><source>IEEE Electronic Library (IEL)</source><creator>Roy, Debaditya ; Fernando, Basura</creator><creatorcontrib>Roy, Debaditya ; Fernando, Basura</creatorcontrib><description>The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt ) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2021.3113114</identifier><identifier>PMID: 34550884</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Convolutional codes ; Cross correlation ; Datasets ; Feature extraction ; Image motion analysis ; image representation ; Image sequence analysis ; Object detection ; object recognition ; Predictive models ; Representations ; Salads ; Smart buildings ; Transformers ; Visualization</subject><ispartof>IEEE transactions on image processing, 2021, Vol.30, p.8116-8129</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c324t-d6132aafb856c852107951058210f70f25e0f52f7a6e38b1bee047b398bd8db73</citedby><cites>FETCH-LOGICAL-c324t-d6132aafb856c852107951058210f70f25e0f52f7a6e38b1bee047b398bd8db73</cites><orcidid>0000-0002-6920-9916 ; 0000-0002-8779-1241</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9546623$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9546623$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Roy, Debaditya</creatorcontrib><creatorcontrib>Fernando, Basura</creatorcontrib><title>Action Anticipation Using Pairwise Human-Object Interactions and Transformers</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><description>The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt ) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.</description><subject>Convolutional codes</subject><subject>Cross correlation</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Image motion analysis</subject><subject>image representation</subject><subject>Image sequence analysis</subject><subject>Object detection</subject><subject>object recognition</subject><subject>Predictive models</subject><subject>Representations</subject><subject>Salads</subject><subject>Smart buildings</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkM1LAzEQxYMoVqt3wcuCFy9b87lJjqWoLVTaQ3sO2d1ZSWmzNdlF_O9NP_AgDMwM_N7j8RB6IHhECNYvq9lyRDElI0ZIGn6BbojmJMeY08t0YyFzSbgeoNsYNxgTLkhxjQaMC4GV4jfoY1x1rvXZ2Heucnt7fNbR-c9saV34dhGyab-zPl-UG6i6bOY7CPYoipn1dbYK1semDTsI8Q5dNXYb4f68h2j99rqaTPP54n02Gc_zilHe5XVBGLW2KZUoKiUowVKLFFalq5G4oQJwI2gjbQFMlaQEwFyWTKuyVnUp2RA9n3z3of3qIXZm52IF26310PbRUCGTmWaMJ_TpH7pp--BTugMlheCK6EThE1WFNsYAjdkHt7PhxxBsDlWbVLU5VG3OVSfJ40niAOAP14IXBWXsF5KFdwI</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Roy, Debaditya</creator><creator>Fernando, Basura</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-6920-9916</orcidid><orcidid>https://orcid.org/0000-0002-8779-1241</orcidid></search><sort><creationdate>2021</creationdate><title>Action Anticipation Using Pairwise Human-Object Interactions and Transformers</title><author>Roy, Debaditya ; Fernando, Basura</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c324t-d6132aafb856c852107951058210f70f25e0f52f7a6e38b1bee047b398bd8db73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Convolutional codes</topic><topic>Cross correlation</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Image motion analysis</topic><topic>image representation</topic><topic>Image sequence analysis</topic><topic>Object detection</topic><topic>object recognition</topic><topic>Predictive models</topic><topic>Representations</topic><topic>Salads</topic><topic>Smart buildings</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Roy, Debaditya</creatorcontrib><creatorcontrib>Fernando, Basura</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Roy, Debaditya</au><au>Fernando, Basura</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Action Anticipation Using Pairwise Human-Object Interactions and Transformers</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><date>2021</date><risdate>2021</risdate><volume>30</volume><spage>8116</spage><epage>8129</epage><pages>8116-8129</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>The ability to anticipate future actions of humans is useful in application areas such as automated driving, robot-assisted manufacturing, and smart homes. These applications require representing and anticipating human actions involving the use of objects. Existing methods that use human-object interactions for anticipation require object affordance labels for every relevant object in the scene that match the ongoing action. Hence, we propose to represent every pairwise human-object (HO) interaction using only their visual features. Next, we use cross-correlation to capture the second-order statistics across human-object pairs in a frame. Cross-correlation produces a holistic representation of the frame that can also handle a variable number of human-object pairs in every frame of the observation period. We show that cross-correlation based frame representation is more suited for action anticipation than attention-based and other second-order approaches. Furthermore, we observe that using a transformer model for temporal aggregation of frame-wise HO representations results in better action anticipation than other temporal networks. So, we propose two approaches for constructing an end-to-end trainable multi-modal transformer (MM-Transformer; code at https://github.com/debadityaroy/MM-Transformer_ActAnt ) model that combines the evidence across spatio-temporal, motion, and HO representations. We show the performance of MM-Transformer on procedural datasets like 50 Salads and Breakfast, and an unscripted dataset like EPIC-KITCHENS55. Finally, we demonstrate that the combination of human-object representation and MM-Transformers is effective even for long-term anticipation.</abstract><cop>New York</cop><pub>IEEE</pub><pmid>34550884</pmid><doi>10.1109/TIP.2021.3113114</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-6920-9916</orcidid><orcidid>https://orcid.org/0000-0002-8779-1241</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1057-7149 |
ispartof | IEEE transactions on image processing, 2021, Vol.30, p.8116-8129 |
issn | 1057-7149 1941-0042 |
language | eng |
recordid | cdi_proquest_journals_2577554819 |
source | IEEE Electronic Library (IEL) |
subjects | Convolutional codes Cross correlation Datasets Feature extraction Image motion analysis image representation Image sequence analysis Object detection object recognition Predictive models Representations Salads Smart buildings Transformers Visualization |
title | Action Anticipation Using Pairwise Human-Object Interactions and Transformers |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T13%3A42%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Action%20Anticipation%20Using%20Pairwise%20Human-Object%20Interactions%20and%20Transformers&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Roy,%20Debaditya&rft.date=2021&rft.volume=30&rft.spage=8116&rft.epage=8129&rft.pages=8116-8129&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2021.3113114&rft_dat=%3Cproquest_RIE%3E2575829334%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2577554819&rft_id=info:pmid/34550884&rft_ieee_id=9546623&rfr_iscdi=true |