Text-Enhanced Zero-Shot Action Recognition: A training-free approach

Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-08
Hauptverfasser: Bosetti, Massimo, Zhang, Shibingfeng, Liberatori, Benedetta, Zara, Giacomo, Ricci, Elisa, Rota, Paolo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Bosetti, Massimo
Zhang, Shibingfeng
Liberatori, Benedetta
Zara, Giacomo
Ricci, Elisa
Rota, Paolo
description Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3098951417</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3098951417</sourcerecordid><originalsourceid>FETCH-proquest_journals_30989514173</originalsourceid><addsrcrecordid>eNqNikELgjAYQEcQJOV_GHQezE1Tu0kZnatTFxnrUyfxzbYJ_fwM-gGd3oP3FiQSUiasSIVYkdj7gXMudrnIMhmR4w3egdXYK9TwoHdwll17G2ilg7FIL6Bth-bre1rR4JRBgx1rHQBV4-is0v2GLFv19BD_uCbbU307nNmcXxP40Ax2cjinRvKyKLMkTXL53_UB8nQ6DQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3098951417</pqid></control><display><type>article</type><title>Text-Enhanced Zero-Shot Action Recognition: A training-free approach</title><source>Freely Accessible Journals</source><creator>Bosetti, Massimo ; Zhang, Shibingfeng ; Liberatori, Benedetta ; Zara, Giacomo ; Ricci, Elisa ; Rota, Paolo</creator><creatorcontrib>Bosetti, Massimo ; Zhang, Shibingfeng ; Liberatori, Benedetta ; Zara, Giacomo ; Ricci, Elisa ; Rota, Paolo</creatorcontrib><description>Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Activity recognition ; Datasets ; Image enhancement ; Vision ; Visual tasks</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Bosetti, Massimo</creatorcontrib><creatorcontrib>Zhang, Shibingfeng</creatorcontrib><creatorcontrib>Liberatori, Benedetta</creatorcontrib><creatorcontrib>Zara, Giacomo</creatorcontrib><creatorcontrib>Ricci, Elisa</creatorcontrib><creatorcontrib>Rota, Paolo</creatorcontrib><title>Text-Enhanced Zero-Shot Action Recognition: A training-free approach</title><title>arXiv.org</title><description>Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.</description><subject>Activity recognition</subject><subject>Datasets</subject><subject>Image enhancement</subject><subject>Vision</subject><subject>Visual tasks</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNikELgjAYQEcQJOV_GHQezE1Tu0kZnatTFxnrUyfxzbYJ_fwM-gGd3oP3FiQSUiasSIVYkdj7gXMudrnIMhmR4w3egdXYK9TwoHdwll17G2ilg7FIL6Bth-bre1rR4JRBgx1rHQBV4-is0v2GLFv19BD_uCbbU307nNmcXxP40Ax2cjinRvKyKLMkTXL53_UB8nQ6DQ</recordid><startdate>20240829</startdate><enddate>20240829</enddate><creator>Bosetti, Massimo</creator><creator>Zhang, Shibingfeng</creator><creator>Liberatori, Benedetta</creator><creator>Zara, Giacomo</creator><creator>Ricci, Elisa</creator><creator>Rota, Paolo</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240829</creationdate><title>Text-Enhanced Zero-Shot Action Recognition: A training-free approach</title><author>Bosetti, Massimo ; Zhang, Shibingfeng ; Liberatori, Benedetta ; Zara, Giacomo ; Ricci, Elisa ; Rota, Paolo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30989514173</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Activity recognition</topic><topic>Datasets</topic><topic>Image enhancement</topic><topic>Vision</topic><topic>Visual tasks</topic><toplevel>online_resources</toplevel><creatorcontrib>Bosetti, Massimo</creatorcontrib><creatorcontrib>Zhang, Shibingfeng</creatorcontrib><creatorcontrib>Liberatori, Benedetta</creatorcontrib><creatorcontrib>Zara, Giacomo</creatorcontrib><creatorcontrib>Ricci, Elisa</creatorcontrib><creatorcontrib>Rota, Paolo</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bosetti, Massimo</au><au>Zhang, Shibingfeng</au><au>Liberatori, Benedetta</au><au>Zara, Giacomo</au><au>Ricci, Elisa</au><au>Rota, Paolo</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Text-Enhanced Zero-Shot Action Recognition: A training-free approach</atitle><jtitle>arXiv.org</jtitle><date>2024-08-29</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-08
issn 2331-8422
language eng
recordid cdi_proquest_journals_3098951417
source Freely Accessible Journals
subjects Activity recognition
Datasets
Image enhancement
Vision
Visual tasks
title Text-Enhanced Zero-Shot Action Recognition: A training-free approach
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T01%3A18%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Text-Enhanced%20Zero-Shot%20Action%20Recognition:%20A%20training-free%20approach&rft.jtitle=arXiv.org&rft.au=Bosetti,%20Massimo&rft.date=2024-08-29&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3098951417%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3098951417&rft_id=info:pmid/&rfr_iscdi=true