RGB Stream Is Enough for Temporal Action Detection

State-of-the-art temporal action detectors to date are based on two-stream input including RGB frames and optical flow. Although combining RGB frames and optical flow boosts performance significantly, optical flow is a hand-designed representation which not only requires heavy computation, but also...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wang, Chenhao, Cai, Hongxiang, Zou, Yuxin, Xiong, Yichao
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Wang, Chenhao Cai, Hongxiang Zou, Yuxin Xiong, Yichao
description	State-of-the-art temporal action detectors to date are based on two-stream input including RGB frames and optical flow. Although combining RGB frames and optical flow boosts performance significantly, optical flow is a hand-designed representation which not only requires heavy computation, but also makes it methodologically unsatisfactory that two-stream methods are often not learned end-to-end jointly with the flow. In this paper, we argue that optical flow is dispensable in high-accuracy temporal action detection and image level data augmentation (ILDA) is the key solution to avoid performance degradation when optical flow is removed. To evaluate the effectiveness of ILDA, we design a simple yet efficient one-stage temporal action detector based on single RGB stream named DaoTAD. Our results show that when trained with ILDA, DaoTAD has comparable accuracy with all existing state-of-the-art two-stream detectors while surpassing the inference speed of previous methods by a large margin and the inference speed is astounding 6668 fps on GeForce GTX 1080 Ti. Code is available at \url{https://github.com/Media-Smart/vedatad}.
doi_str_mv	10.48550/arxiv.2107.04362
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2107_04362</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2107_04362</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-d217a4e6dec0d6451760a479e43fa88cb9cfaca4fcc20105c7f6fed28bc6e9b53</originalsourceid><addsrcrecordid>eNotzstuwjAQhWFvukC0D8AKv0BS2_ElWVLKTYqEBNlHk8kYIhGCTFrB2yMCq_Ovjj7GJlLEOjVGfEO4Nf-xksLFQidWjZjarX74vg8ELd9c-eLc_R2O3HeBF9ReugAnPsO-6c78l3oa6pN9eDhd6eu9Y1YsF8V8HeXb1WY-yyOwTkW1kg402ZpQ1FYb6awA7TLSiYc0xSpDDwjaIyohhUHnradapRVayiqTjNn0dTugy0toWgj38okvB3zyAEk7Pp4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>RGB Stream Is Enough for Temporal Action Detection</title><source>arXiv.org</source><creator>Wang, Chenhao ; Cai, Hongxiang ; Zou, Yuxin ; Xiong, Yichao</creator><creatorcontrib>Wang, Chenhao ; Cai, Hongxiang ; Zou, Yuxin ; Xiong, Yichao</creatorcontrib><description>State-of-the-art temporal action detectors to date are based on two-stream input including RGB frames and optical flow. Although combining RGB frames and optical flow boosts performance significantly, optical flow is a hand-designed representation which not only requires heavy computation, but also makes it methodologically unsatisfactory that two-stream methods are often not learned end-to-end jointly with the flow. In this paper, we argue that optical flow is dispensable in high-accuracy temporal action detection and image level data augmentation (ILDA) is the key solution to avoid performance degradation when optical flow is removed. To evaluate the effectiveness of ILDA, we design a simple yet efficient one-stage temporal action detector based on single RGB stream named DaoTAD. Our results show that when trained with ILDA, DaoTAD has comparable accuracy with all existing state-of-the-art two-stream detectors while surpassing the inference speed of previous methods by a large margin and the inference speed is astounding 6668 fps on GeForce GTX 1080 Ti. Code is available at \url{https://github.com/Media-Smart/vedatad}.</description><identifier>DOI: 10.48550/arxiv.2107.04362</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2021-07</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2107.04362$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2107.04362$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Chenhao</creatorcontrib><creatorcontrib>Cai, Hongxiang</creatorcontrib><creatorcontrib>Zou, Yuxin</creatorcontrib><creatorcontrib>Xiong, Yichao</creatorcontrib><title>RGB Stream Is Enough for Temporal Action Detection</title><description>State-of-the-art temporal action detectors to date are based on two-stream input including RGB frames and optical flow. Although combining RGB frames and optical flow boosts performance significantly, optical flow is a hand-designed representation which not only requires heavy computation, but also makes it methodologically unsatisfactory that two-stream methods are often not learned end-to-end jointly with the flow. In this paper, we argue that optical flow is dispensable in high-accuracy temporal action detection and image level data augmentation (ILDA) is the key solution to avoid performance degradation when optical flow is removed. To evaluate the effectiveness of ILDA, we design a simple yet efficient one-stage temporal action detector based on single RGB stream named DaoTAD. Our results show that when trained with ILDA, DaoTAD has comparable accuracy with all existing state-of-the-art two-stream detectors while surpassing the inference speed of previous methods by a large margin and the inference speed is astounding 6668 fps on GeForce GTX 1080 Ti. Code is available at \url{https://github.com/Media-Smart/vedatad}.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzstuwjAQhWFvukC0D8AKv0BS2_ElWVLKTYqEBNlHk8kYIhGCTFrB2yMCq_Ovjj7GJlLEOjVGfEO4Nf-xksLFQidWjZjarX74vg8ELd9c-eLc_R2O3HeBF9ReugAnPsO-6c78l3oa6pN9eDhd6eu9Y1YsF8V8HeXb1WY-yyOwTkW1kg402ZpQ1FYb6awA7TLSiYc0xSpDDwjaIyohhUHnradapRVayiqTjNn0dTugy0toWgj38okvB3zyAEk7Pp4</recordid><startdate>20210709</startdate><enddate>20210709</enddate><creator>Wang, Chenhao</creator><creator>Cai, Hongxiang</creator><creator>Zou, Yuxin</creator><creator>Xiong, Yichao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210709</creationdate><title>RGB Stream Is Enough for Temporal Action Detection</title><author>Wang, Chenhao ; Cai, Hongxiang ; Zou, Yuxin ; Xiong, Yichao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-d217a4e6dec0d6451760a479e43fa88cb9cfaca4fcc20105c7f6fed28bc6e9b53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Chenhao</creatorcontrib><creatorcontrib>Cai, Hongxiang</creatorcontrib><creatorcontrib>Zou, Yuxin</creatorcontrib><creatorcontrib>Xiong, Yichao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Chenhao</au><au>Cai, Hongxiang</au><au>Zou, Yuxin</au><au>Xiong, Yichao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>RGB Stream Is Enough for Temporal Action Detection</atitle><date>2021-07-09</date><risdate>2021</risdate><abstract>State-of-the-art temporal action detectors to date are based on two-stream input including RGB frames and optical flow. Although combining RGB frames and optical flow boosts performance significantly, optical flow is a hand-designed representation which not only requires heavy computation, but also makes it methodologically unsatisfactory that two-stream methods are often not learned end-to-end jointly with the flow. In this paper, we argue that optical flow is dispensable in high-accuracy temporal action detection and image level data augmentation (ILDA) is the key solution to avoid performance degradation when optical flow is removed. To evaluate the effectiveness of ILDA, we design a simple yet efficient one-stage temporal action detector based on single RGB stream named DaoTAD. Our results show that when trained with ILDA, DaoTAD has comparable accuracy with all existing state-of-the-art two-stream detectors while surpassing the inference speed of previous methods by a large margin and the inference speed is astounding 6668 fps on GeForce GTX 1080 Ti. Code is available at \url{https://github.com/Media-Smart/vedatad}.</abstract><doi>10.48550/arxiv.2107.04362</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2107.04362
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2107_04362
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
title	RGB Stream Is Enough for Temporal Action Detection
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T13%3A00%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=RGB%20Stream%20Is%20Enough%20for%20Temporal%20Action%20Detection&rft.au=Wang,%20Chenhao&rft.date=2021-07-09&rft_id=info:doi/10.48550/arxiv.2107.04362&rft_dat=%3Carxiv_GOX%3E2107_04362%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true