Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting

Since collecting and annotating data for spatio-temporal action detection is very expensive, there is a need to learn approaches with less supervision. Weakly supervised approaches do not require any bounding box annotations and can be trained only from labels that indicate whether an action occurs...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Biswas, Sovan, Gall, Juergen
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Biswas, Sovan Gall, Juergen
description	Since collecting and annotating data for spatio-temporal action detection is very expensive, there is a need to learn approaches with less supervision. Weakly supervised approaches do not require any bounding box annotations and can be trained only from labels that indicate whether an action occurs in a video clip. Current approaches, however, cannot handle the case when there are multiple persons in a video that perform multiple actions at the same time. In this work, we address this very challenging task for the first time. We propose a baseline based on multi-instance and multi-label learning. Furthermore, we propose a novel approach that uses sets of actions as representation instead of modeling individual action classes. Since computing, the probabilities for the full power set becomes intractable as the number of action classes increases, we assign an action set to each detected person under the constraint that the assignment is consistent with the annotation of the video clip. We evaluate the proposed approach on the challenging AVA dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.
doi_str_mv	10.48550/arxiv.2101.08567
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2101_08567</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2101_08567</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-77505f1a3c7939a359751655ac5787d52f7950b7821d72d4bd444b84cea4c8783</originalsourceid><addsrcrecordid>eNotj71ugzAURr10qJI-QKf6BaA29uWaEaW_ElWGROqILsZUVihENkHN2zelWb7zTUc6jN1LkWoDIB4p_Pg5zaSQqTCQ4y3bPvlox9kFP3zxj1M_-aSixvW8tNMYksv6ceBljKP1tHw_cOKfjg79me9ORxdmH13Ld26aLo41u-moj-7uyhXbvzzvN29JtX1935RVQjligggCOknKYqEKUlAgyByALKDBFrIOCxANmky2mLW6abXWjdHWkbYGjVqxh3_tUlQfg_-mcK7_yuqlTP0Ch_hIHA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting</title><source>arXiv.org</source><creator>Biswas, Sovan ; Gall, Juergen</creator><creatorcontrib>Biswas, Sovan ; Gall, Juergen</creatorcontrib><description>Since collecting and annotating data for spatio-temporal action detection is very expensive, there is a need to learn approaches with less supervision. Weakly supervised approaches do not require any bounding box annotations and can be trained only from labels that indicate whether an action occurs in a video clip. Current approaches, however, cannot handle the case when there are multiple persons in a video that perform multiple actions at the same time. In this work, we address this very challenging task for the first time. We propose a baseline based on multi-instance and multi-label learning. Furthermore, we propose a novel approach that uses sets of actions as representation instead of modeling individual action classes. Since computing, the probabilities for the full power set becomes intractable as the number of action classes increases, we assign an action set to each detected person under the constraint that the assignment is consistent with the annotation of the video clip. We evaluate the proposed approach on the challenging AVA dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.</description><identifier>DOI: 10.48550/arxiv.2101.08567</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2021-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2101.08567$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2101.08567$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Biswas, Sovan</creatorcontrib><creatorcontrib>Gall, Juergen</creatorcontrib><title>Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting</title><description>Since collecting and annotating data for spatio-temporal action detection is very expensive, there is a need to learn approaches with less supervision. Weakly supervised approaches do not require any bounding box annotations and can be trained only from labels that indicate whether an action occurs in a video clip. Current approaches, however, cannot handle the case when there are multiple persons in a video that perform multiple actions at the same time. In this work, we address this very challenging task for the first time. We propose a baseline based on multi-instance and multi-label learning. Furthermore, we propose a novel approach that uses sets of actions as representation instead of modeling individual action classes. Since computing, the probabilities for the full power set becomes intractable as the number of action classes increases, we assign an action set to each detected person under the constraint that the assignment is consistent with the annotation of the video clip. We evaluate the proposed approach on the challenging AVA dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71ugzAURr10qJI-QKf6BaA29uWaEaW_ElWGROqILsZUVihENkHN2zelWb7zTUc6jN1LkWoDIB4p_Pg5zaSQqTCQ4y3bPvlox9kFP3zxj1M_-aSixvW8tNMYksv6ceBljKP1tHw_cOKfjg79me9ORxdmH13Ld26aLo41u-moj-7uyhXbvzzvN29JtX1935RVQjligggCOknKYqEKUlAgyByALKDBFrIOCxANmky2mLW6abXWjdHWkbYGjVqxh3_tUlQfg_-mcK7_yuqlTP0Ch_hIHA</recordid><startdate>20210121</startdate><enddate>20210121</enddate><creator>Biswas, Sovan</creator><creator>Gall, Juergen</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210121</creationdate><title>Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting</title><author>Biswas, Sovan ; Gall, Juergen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-77505f1a3c7939a359751655ac5787d52f7950b7821d72d4bd444b84cea4c8783</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Biswas, Sovan</creatorcontrib><creatorcontrib>Gall, Juergen</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Biswas, Sovan</au><au>Gall, Juergen</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting</atitle><date>2021-01-21</date><risdate>2021</risdate><abstract>Since collecting and annotating data for spatio-temporal action detection is very expensive, there is a need to learn approaches with less supervision. Weakly supervised approaches do not require any bounding box annotations and can be trained only from labels that indicate whether an action occurs in a video clip. Current approaches, however, cannot handle the case when there are multiple persons in a video that perform multiple actions at the same time. In this work, we address this very challenging task for the first time. We propose a baseline based on multi-instance and multi-label learning. Furthermore, we propose a novel approach that uses sets of actions as representation instead of modeling individual action classes. Since computing, the probabilities for the full power set becomes intractable as the number of action classes increases, we assign an action set to each detected person under the constraint that the assignment is consistent with the annotation of the video clip. We evaluate the proposed approach on the challenging AVA dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.</abstract><doi>10.48550/arxiv.2101.08567</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2101.08567
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2101_08567
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Discovering Multi-Label Actor-Action Association in a Weakly Supervised Setting
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T18%3A54%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Discovering%20Multi-Label%20Actor-Action%20Association%20in%20a%20Weakly%20Supervised%20Setting&rft.au=Biswas,%20Sovan&rft.date=2021-01-21&rft_id=info:doi/10.48550/arxiv.2101.08567&rft_dat=%3Carxiv_GOX%3E2101_08567%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true