Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel dep...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Diba, Ali, Fayyaz, Mohsen, Sharma, Vivek, Karami, Amir Hossein, Arzani, Mohammad Mahdi, Yousefzadeh, Rahman, Van Gool, Luc
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Diba, Ali Fayyaz, Mohsen Sharma, Vivek Karami, Amir Hossein Arzani, Mohammad Mahdi Yousefzadeh, Rahman Van Gool, Luc
description	The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network `Temporal 3D ConvNet'~(T3D) and its new temporal layer `Temporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released
doi_str_mv	10.48550/arxiv.1711.08200
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1711_08200</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1711_08200</sourcerecordid><originalsourceid>FETCH-LOGICAL-a1150-ffef411dc746c719fbc82923254b5d4d51bf6b06d03b5e3a1293b66cb519030c3</originalsourceid><addsrcrecordid>eNotz7tOwzAUgGEvDKjwAEz4BRJ84jgXtipcpagsadfo2D4ullKnskOBt0cUpn_7pY-xGxB52Sgl7jB--VMONUAumkKIS7Yd6HCcI05cPvBuDqcNLemeb-iTr6N59wuZ5SMSx2D5EDEkR5H3hDH4sOdujnznLc28mzAl77zBxc_hil04nBJd_3fFtk-PQ_eS9W_Pr926zxBAicw5ciWANXVZmRpap01TtIUsVKmVLa0C7SotKiukViQRilbqqjJaQSukMHLFbv--Z9h4jP6A8Xv8BY5noPwBxkxK4Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification</title><source>arXiv.org</source><creator>Diba, Ali ; Fayyaz, Mohsen ; Sharma, Vivek ; Karami, Amir Hossein ; Arzani, Mohammad Mahdi ; Yousefzadeh, Rahman ; Van Gool, Luc</creator><creatorcontrib>Diba, Ali ; Fayyaz, Mohsen ; Sharma, Vivek ; Karami, Amir Hossein ; Arzani, Mohammad Mahdi ; Yousefzadeh, Rahman ; Van Gool, Luc</creatorcontrib><description>The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network `Temporal 3D ConvNet'~(T3D) and its new temporal layer `Temporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released</description><identifier>DOI: 10.48550/arxiv.1711.08200</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2017-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a1150-ffef411dc746c719fbc82923254b5d4d51bf6b06d03b5e3a1293b66cb519030c3</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1711.08200$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1711.08200$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Diba, Ali</creatorcontrib><creatorcontrib>Fayyaz, Mohsen</creatorcontrib><creatorcontrib>Sharma, Vivek</creatorcontrib><creatorcontrib>Karami, Amir Hossein</creatorcontrib><creatorcontrib>Arzani, Mohammad Mahdi</creatorcontrib><creatorcontrib>Yousefzadeh, Rahman</creatorcontrib><creatorcontrib>Van Gool, Luc</creatorcontrib><title>Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification</title><description>The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network `Temporal 3D ConvNet'~(T3D) and its new temporal layer `Temporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7tOwzAUgGEvDKjwAEz4BRJ84jgXtipcpagsadfo2D4ullKnskOBt0cUpn_7pY-xGxB52Sgl7jB--VMONUAumkKIS7Yd6HCcI05cPvBuDqcNLemeb-iTr6N59wuZ5SMSx2D5EDEkR5H3hDH4sOdujnznLc28mzAl77zBxc_hil04nBJd_3fFtk-PQ_eS9W_Pr926zxBAicw5ciWANXVZmRpap01TtIUsVKmVLa0C7SotKiukViQRilbqqjJaQSukMHLFbv--Z9h4jP6A8Xv8BY5noPwBxkxK4Q</recordid><startdate>20171122</startdate><enddate>20171122</enddate><creator>Diba, Ali</creator><creator>Fayyaz, Mohsen</creator><creator>Sharma, Vivek</creator><creator>Karami, Amir Hossein</creator><creator>Arzani, Mohammad Mahdi</creator><creator>Yousefzadeh, Rahman</creator><creator>Van Gool, Luc</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20171122</creationdate><title>Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification</title><author>Diba, Ali ; Fayyaz, Mohsen ; Sharma, Vivek ; Karami, Amir Hossein ; Arzani, Mohammad Mahdi ; Yousefzadeh, Rahman ; Van Gool, Luc</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a1150-ffef411dc746c719fbc82923254b5d4d51bf6b06d03b5e3a1293b66cb519030c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Diba, Ali</creatorcontrib><creatorcontrib>Fayyaz, Mohsen</creatorcontrib><creatorcontrib>Sharma, Vivek</creatorcontrib><creatorcontrib>Karami, Amir Hossein</creatorcontrib><creatorcontrib>Arzani, Mohammad Mahdi</creatorcontrib><creatorcontrib>Yousefzadeh, Rahman</creatorcontrib><creatorcontrib>Van Gool, Luc</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Diba, Ali</au><au>Fayyaz, Mohsen</au><au>Sharma, Vivek</au><au>Karami, Amir Hossein</au><au>Arzani, Mohammad Mahdi</au><au>Yousefzadeh, Rahman</au><au>Van Gool, Luc</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification</atitle><date>2017-11-22</date><risdate>2017</risdate><abstract>The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network `Temporal 3D ConvNet'~(T3D) and its new temporal layer `Temporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released</abstract><doi>10.48550/arxiv.1711.08200</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.1711.08200
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_1711_08200
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T06%3A53%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Temporal%203D%20ConvNets:%20New%20Architecture%20and%20Transfer%20Learning%20for%20Video%20Classification&rft.au=Diba,%20Ali&rft.date=2017-11-22&rft_id=info:doi/10.48550/arxiv.1711.08200&rft_dat=%3Carxiv_GOX%3E1711_08200%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true