3-D Deconvolutional Networks for the Unsupervised Representation Learning of Human Motions

Data representation learning is one of the most important problems in machine learning. Unsupervised representation learning becomes meritorious as it has no necessity of label information with observed data. Due to the highly time-consuming learning of deep-learning models, there are many machine-l...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on cybernetics 2022-01, Vol.52 (1), p.398-410
Hauptverfasser: Zhang, Chun-Yang, Xiao, Yong-Yi, Lin, Jin-Cheng, Chen, C. L. Philip, Liu, Wenxi, Tong, Yu-Hong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 410
container_issue 1
container_start_page 398
container_title IEEE transactions on cybernetics
container_volume 52
creator Zhang, Chun-Yang
Xiao, Yong-Yi
Lin, Jin-Cheng
Chen, C. L. Philip
Liu, Wenxi
Tong, Yu-Hong
description Data representation learning is one of the most important problems in machine learning. Unsupervised representation learning becomes meritorious as it has no necessity of label information with observed data. Due to the highly time-consuming learning of deep-learning models, there are many machine-learning models directly adapting well-trained deep models that are obtained in a supervised and end-to-end manner as feature abstractors to distinct problems. However, it is obvious that different machine-learning tasks require disparate representation of original input data. Taking human action recognition as an example, it is well known that human actions in a video sequence are 3-D signals containing both visual appearance and motion dynamics of humans and objects. Therefore, the data representation approaches with the capabilities to capture both spatial and temporal correlations in videos are meaningful. Most of the existing human motion recognition models build classifiers based on deep-learning structures such as deep convolutional networks. These models require a large quantity of training videos with annotations. Meanwhile, these supervised models cannot recognize samples from the distinct dataset without retraining. In this article, we propose a new 3-D deconvolutional network (3DDN) for representation learning of high-dimensional video data, in which the high-level features are obtained through the optimization approach. The proposed 3DDN decomposes the video frames into spatiotemporal features under a sparse constraint in an unsupervised way. In addition, it also can be regarded as a building block to develop deep architectures by stacking. The high-level representation of input sequential data can be used in multiple downstream machine-learning tasks, we evaluate the proposed 3DDN and its deep models in human action recognition. The experimental results from three datasets: 1) KTH data; 2) HMDB-51; and 3) UCF-101, demonstrate that the proposed 3DDN is an alternative approach to feedforward convolutional neural networks (CNNs), that attains comparable results.
doi_str_mv 10.1109/TCYB.2020.2973300
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2619018410</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9028182</ieee_id><sourcerecordid>2375505766</sourcerecordid><originalsourceid>FETCH-LOGICAL-c349t-4151f38cf00a54e94b12096110eafcbb3a1a0c82155514fc98c6af406359ca1c3</originalsourceid><addsrcrecordid>eNpdkMFuFDEMQCMEolXpByAkFIkLl1nsZJJJjrBtKdK2SKg9wGWUTR2YMjtZkplW_H0z2mUP-GLLfrbkx9hrhAUi2A83y--fFgIELIRtpAR4xo4FalMJ0ajnh1o3R-w053soYUrLmpfsSAqsrW7gmP2Q1Rk_Ix-Hh9hPYxcH1_NrGh9j-p15iImPv4jfDnnaUnroMt3xb7RNlGkY3YzzFbk0dMNPHgO_nDZu4FdxHuRX7EVwfabTfT5htxfnN8vLavX185flx1XlZW3HqkaFQRofAJyqydZrFGB1-ZFc8Ou1dOjAG4FKKayDt8ZrF2rQUlnv0MsT9n53d5vin4ny2G667Knv3UBxyq2QjVKgGq0L-u4_9D5OqbxcKI0W0NQIhcId5VPMOVFot6nbuPS3RWhn9-3svp3dt3v3Zeft_vK03tDdYeOf6QK82QEdER3GFoRBI-QTyQyG2w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2619018410</pqid></control><display><type>article</type><title>3-D Deconvolutional Networks for the Unsupervised Representation Learning of Human Motions</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Chun-Yang ; Xiao, Yong-Yi ; Lin, Jin-Cheng ; Chen, C. L. Philip ; Liu, Wenxi ; Tong, Yu-Hong</creator><creatorcontrib>Zhang, Chun-Yang ; Xiao, Yong-Yi ; Lin, Jin-Cheng ; Chen, C. L. Philip ; Liu, Wenxi ; Tong, Yu-Hong</creatorcontrib><description>Data representation learning is one of the most important problems in machine learning. Unsupervised representation learning becomes meritorious as it has no necessity of label information with observed data. Due to the highly time-consuming learning of deep-learning models, there are many machine-learning models directly adapting well-trained deep models that are obtained in a supervised and end-to-end manner as feature abstractors to distinct problems. However, it is obvious that different machine-learning tasks require disparate representation of original input data. Taking human action recognition as an example, it is well known that human actions in a video sequence are 3-D signals containing both visual appearance and motion dynamics of humans and objects. Therefore, the data representation approaches with the capabilities to capture both spatial and temporal correlations in videos are meaningful. Most of the existing human motion recognition models build classifiers based on deep-learning structures such as deep convolutional networks. These models require a large quantity of training videos with annotations. Meanwhile, these supervised models cannot recognize samples from the distinct dataset without retraining. In this article, we propose a new 3-D deconvolutional network (3DDN) for representation learning of high-dimensional video data, in which the high-level features are obtained through the optimization approach. The proposed 3DDN decomposes the video frames into spatiotemporal features under a sparse constraint in an unsupervised way. In addition, it also can be regarded as a building block to develop deep architectures by stacking. The high-level representation of input sequential data can be used in multiple downstream machine-learning tasks, we evaluate the proposed 3DDN and its deep models in human action recognition. The experimental results from three datasets: 1) KTH data; 2) HMDB-51; and 3) UCF-101, demonstrate that the proposed 3DDN is an alternative approach to feedforward convolutional neural networks (CNNs), that attains comparable results.</description><identifier>ISSN: 2168-2267</identifier><identifier>EISSN: 2168-2275</identifier><identifier>DOI: 10.1109/TCYB.2020.2973300</identifier><identifier>PMID: 32149670</identifier><identifier>CODEN: ITCEB8</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>3-D deconvolutional networks (3DDNs) ; Annotations ; Artificial neural networks ; Cognitive tasks ; Convolution ; Correlation ; Data models ; data representation ; Datasets ; Deep learning ; Feature extraction ; Human Activities ; Human activity recognition ; Human motion ; human motion analysis ; Humans ; Machine Learning ; Motion perception ; Neural Networks, Computer ; Optimization ; Representations ; Task analysis ; unsupervised learning ; Video data ; video representation learning ; Visual signals</subject><ispartof>IEEE transactions on cybernetics, 2022-01, Vol.52 (1), p.398-410</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c349t-4151f38cf00a54e94b12096110eafcbb3a1a0c82155514fc98c6af406359ca1c3</citedby><cites>FETCH-LOGICAL-c349t-4151f38cf00a54e94b12096110eafcbb3a1a0c82155514fc98c6af406359ca1c3</cites><orcidid>0000-0002-3630-6322 ; 0000-0001-6151-7028 ; 0000-0001-5451-7230</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9028182$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27903,27904,54736</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9028182$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32149670$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Chun-Yang</creatorcontrib><creatorcontrib>Xiao, Yong-Yi</creatorcontrib><creatorcontrib>Lin, Jin-Cheng</creatorcontrib><creatorcontrib>Chen, C. L. Philip</creatorcontrib><creatorcontrib>Liu, Wenxi</creatorcontrib><creatorcontrib>Tong, Yu-Hong</creatorcontrib><title>3-D Deconvolutional Networks for the Unsupervised Representation Learning of Human Motions</title><title>IEEE transactions on cybernetics</title><addtitle>TCYB</addtitle><addtitle>IEEE Trans Cybern</addtitle><description>Data representation learning is one of the most important problems in machine learning. Unsupervised representation learning becomes meritorious as it has no necessity of label information with observed data. Due to the highly time-consuming learning of deep-learning models, there are many machine-learning models directly adapting well-trained deep models that are obtained in a supervised and end-to-end manner as feature abstractors to distinct problems. However, it is obvious that different machine-learning tasks require disparate representation of original input data. Taking human action recognition as an example, it is well known that human actions in a video sequence are 3-D signals containing both visual appearance and motion dynamics of humans and objects. Therefore, the data representation approaches with the capabilities to capture both spatial and temporal correlations in videos are meaningful. Most of the existing human motion recognition models build classifiers based on deep-learning structures such as deep convolutional networks. These models require a large quantity of training videos with annotations. Meanwhile, these supervised models cannot recognize samples from the distinct dataset without retraining. In this article, we propose a new 3-D deconvolutional network (3DDN) for representation learning of high-dimensional video data, in which the high-level features are obtained through the optimization approach. The proposed 3DDN decomposes the video frames into spatiotemporal features under a sparse constraint in an unsupervised way. In addition, it also can be regarded as a building block to develop deep architectures by stacking. The high-level representation of input sequential data can be used in multiple downstream machine-learning tasks, we evaluate the proposed 3DDN and its deep models in human action recognition. The experimental results from three datasets: 1) KTH data; 2) HMDB-51; and 3) UCF-101, demonstrate that the proposed 3DDN is an alternative approach to feedforward convolutional neural networks (CNNs), that attains comparable results.</description><subject>3-D deconvolutional networks (3DDNs)</subject><subject>Annotations</subject><subject>Artificial neural networks</subject><subject>Cognitive tasks</subject><subject>Convolution</subject><subject>Correlation</subject><subject>Data models</subject><subject>data representation</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Human Activities</subject><subject>Human activity recognition</subject><subject>Human motion</subject><subject>human motion analysis</subject><subject>Humans</subject><subject>Machine Learning</subject><subject>Motion perception</subject><subject>Neural Networks, Computer</subject><subject>Optimization</subject><subject>Representations</subject><subject>Task analysis</subject><subject>unsupervised learning</subject><subject>Video data</subject><subject>video representation learning</subject><subject>Visual signals</subject><issn>2168-2267</issn><issn>2168-2275</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><sourceid>EIF</sourceid><recordid>eNpdkMFuFDEMQCMEolXpByAkFIkLl1nsZJJJjrBtKdK2SKg9wGWUTR2YMjtZkplW_H0z2mUP-GLLfrbkx9hrhAUi2A83y--fFgIELIRtpAR4xo4FalMJ0ajnh1o3R-w053soYUrLmpfsSAqsrW7gmP2Q1Rk_Ix-Hh9hPYxcH1_NrGh9j-p15iImPv4jfDnnaUnroMt3xb7RNlGkY3YzzFbk0dMNPHgO_nDZu4FdxHuRX7EVwfabTfT5htxfnN8vLavX185flx1XlZW3HqkaFQRofAJyqydZrFGB1-ZFc8Ou1dOjAG4FKKayDt8ZrF2rQUlnv0MsT9n53d5vin4ny2G667Knv3UBxyq2QjVKgGq0L-u4_9D5OqbxcKI0W0NQIhcId5VPMOVFot6nbuPS3RWhn9-3svp3dt3v3Zeft_vK03tDdYeOf6QK82QEdER3GFoRBI-QTyQyG2w</recordid><startdate>202201</startdate><enddate>202201</enddate><creator>Zhang, Chun-Yang</creator><creator>Xiao, Yong-Yi</creator><creator>Lin, Jin-Cheng</creator><creator>Chen, C. L. Philip</creator><creator>Liu, Wenxi</creator><creator>Tong, Yu-Hong</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7TB</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-3630-6322</orcidid><orcidid>https://orcid.org/0000-0001-6151-7028</orcidid><orcidid>https://orcid.org/0000-0001-5451-7230</orcidid></search><sort><creationdate>202201</creationdate><title>3-D Deconvolutional Networks for the Unsupervised Representation Learning of Human Motions</title><author>Zhang, Chun-Yang ; Xiao, Yong-Yi ; Lin, Jin-Cheng ; Chen, C. L. Philip ; Liu, Wenxi ; Tong, Yu-Hong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c349t-4151f38cf00a54e94b12096110eafcbb3a1a0c82155514fc98c6af406359ca1c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>3-D deconvolutional networks (3DDNs)</topic><topic>Annotations</topic><topic>Artificial neural networks</topic><topic>Cognitive tasks</topic><topic>Convolution</topic><topic>Correlation</topic><topic>Data models</topic><topic>data representation</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Human Activities</topic><topic>Human activity recognition</topic><topic>Human motion</topic><topic>human motion analysis</topic><topic>Humans</topic><topic>Machine Learning</topic><topic>Motion perception</topic><topic>Neural Networks, Computer</topic><topic>Optimization</topic><topic>Representations</topic><topic>Task analysis</topic><topic>unsupervised learning</topic><topic>Video data</topic><topic>video representation learning</topic><topic>Visual signals</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Chun-Yang</creatorcontrib><creatorcontrib>Xiao, Yong-Yi</creatorcontrib><creatorcontrib>Lin, Jin-Cheng</creatorcontrib><creatorcontrib>Chen, C. L. Philip</creatorcontrib><creatorcontrib>Liu, Wenxi</creatorcontrib><creatorcontrib>Tong, Yu-Hong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on cybernetics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Chun-Yang</au><au>Xiao, Yong-Yi</au><au>Lin, Jin-Cheng</au><au>Chen, C. L. Philip</au><au>Liu, Wenxi</au><au>Tong, Yu-Hong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>3-D Deconvolutional Networks for the Unsupervised Representation Learning of Human Motions</atitle><jtitle>IEEE transactions on cybernetics</jtitle><stitle>TCYB</stitle><addtitle>IEEE Trans Cybern</addtitle><date>2022-01</date><risdate>2022</risdate><volume>52</volume><issue>1</issue><spage>398</spage><epage>410</epage><pages>398-410</pages><issn>2168-2267</issn><eissn>2168-2275</eissn><coden>ITCEB8</coden><abstract>Data representation learning is one of the most important problems in machine learning. Unsupervised representation learning becomes meritorious as it has no necessity of label information with observed data. Due to the highly time-consuming learning of deep-learning models, there are many machine-learning models directly adapting well-trained deep models that are obtained in a supervised and end-to-end manner as feature abstractors to distinct problems. However, it is obvious that different machine-learning tasks require disparate representation of original input data. Taking human action recognition as an example, it is well known that human actions in a video sequence are 3-D signals containing both visual appearance and motion dynamics of humans and objects. Therefore, the data representation approaches with the capabilities to capture both spatial and temporal correlations in videos are meaningful. Most of the existing human motion recognition models build classifiers based on deep-learning structures such as deep convolutional networks. These models require a large quantity of training videos with annotations. Meanwhile, these supervised models cannot recognize samples from the distinct dataset without retraining. In this article, we propose a new 3-D deconvolutional network (3DDN) for representation learning of high-dimensional video data, in which the high-level features are obtained through the optimization approach. The proposed 3DDN decomposes the video frames into spatiotemporal features under a sparse constraint in an unsupervised way. In addition, it also can be regarded as a building block to develop deep architectures by stacking. The high-level representation of input sequential data can be used in multiple downstream machine-learning tasks, we evaluate the proposed 3DDN and its deep models in human action recognition. The experimental results from three datasets: 1) KTH data; 2) HMDB-51; and 3) UCF-101, demonstrate that the proposed 3DDN is an alternative approach to feedforward convolutional neural networks (CNNs), that attains comparable results.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>32149670</pmid><doi>10.1109/TCYB.2020.2973300</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-3630-6322</orcidid><orcidid>https://orcid.org/0000-0001-6151-7028</orcidid><orcidid>https://orcid.org/0000-0001-5451-7230</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2168-2267
ispartof IEEE transactions on cybernetics, 2022-01, Vol.52 (1), p.398-410
issn 2168-2267
2168-2275
language eng
recordid cdi_proquest_journals_2619018410
source IEEE Electronic Library (IEL)
subjects 3-D deconvolutional networks (3DDNs)
Annotations
Artificial neural networks
Cognitive tasks
Convolution
Correlation
Data models
data representation
Datasets
Deep learning
Feature extraction
Human Activities
Human activity recognition
Human motion
human motion analysis
Humans
Machine Learning
Motion perception
Neural Networks, Computer
Optimization
Representations
Task analysis
unsupervised learning
Video data
video representation learning
Visual signals
title 3-D Deconvolutional Networks for the Unsupervised Representation Learning of Human Motions
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T05%3A04%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=3-D%20Deconvolutional%20Networks%20for%20the%20Unsupervised%20Representation%20Learning%20of%20Human%20Motions&rft.jtitle=IEEE%20transactions%20on%20cybernetics&rft.au=Zhang,%20Chun-Yang&rft.date=2022-01&rft.volume=52&rft.issue=1&rft.spage=398&rft.epage=410&rft.pages=398-410&rft.issn=2168-2267&rft.eissn=2168-2275&rft.coden=ITCEB8&rft_id=info:doi/10.1109/TCYB.2020.2973300&rft_dat=%3Cproquest_RIE%3E2375505766%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2619018410&rft_id=info:pmid/32149670&rft_ieee_id=9028182&rfr_iscdi=true