Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence

This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynam...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2024-12, Vol.35 (12), p.18186-18199
Hauptverfasser:	Tan, Bo, Xiao, Yang, Wang, Yancheng, Li, Shuai, Yang, Jianyu, Cao, Zhiguo, Zhou, Joey Tianyi, Yuan, Junsong
Format:	Artikel
Sprache:	eng
Schlagworte:	Contrastive learning Contrastive learning (CL) Data augmentation feature augmentation Point cloud compression point cloud sequence Representation learning Skeleton Spatiotemporal phenomena Three-dimensional displays Training unsupervised 3-D action representation learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	18199
container_issue	12
container_start_page	18186
container_title	IEEE transaction on neural networks and learning systems
container_volume	35
creator	Tan, Bo Xiao, Yang Wang, Yancheng Li, Shuai Yang, Jianyu Cao, Zhiguo Zhou, Joey Tianyi Yuan, Junsong
description	This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynamic voxel (3DV) with contrastive learning (CL). The 3DV can compress the point cloud sequence into a compact point cloud of 3-D motion information. Spatiotemporal data augmentations are conducted on it to drive CL. However, we find that existing CL methods (e.g., SimCLR or MoCo v2) often suffer from high pattern variance toward the augmented 3DV samples from the same action instance, that is, the augmented 3DV samples are still of high feature complementarity after CL, while the complementary discriminative clues within them have not been well exploited yet. To address this, a feature augmentation adapted CL (FACL) approach is proposed, which facilitates 3-D action representation via concerning the features from all augmented 3DV samples jointly, in spirit of feature augmentation. FACL runs in a global-local way: one branch learns global feature that involves the discriminative clues from the raw and augmented 3DV samples, and the other focuses on enhancing the discriminative power of local feature learned from each augmented 3DV sample. The global and local features are fused to characterize 3-D action jointly via concatenation. To fit FACL, a series of spatiotemporal data augmentation approaches is also studied on 3DV. Wide-range experiments verify the superiority of our unsupervised learning method for 3-D action feature learning. It outperforms the state-of-the-art skeleton-based counterparts by 6.4% and 3.6% with the cross-setup and cross-subject test settings on NTU RGB+D 120, respectively. The source code is available at https://github.com/tangent-T/FACL .
doi_str_mv	10.1109/TNNLS.2023.3312673
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TNNLS_2023_3312673</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10256675</ieee_id><sourcerecordid>2867152843</sourcerecordid><originalsourceid>FETCH-LOGICAL-c275t-8c3f2166b025157a938d062a40a70bc997bdabe5343cf71fdde8e1bbd4d8f7af3</originalsourceid><addsrcrecordid>eNpNkMtOwzAQRS0Eoqj0BxBCXrJJ8SO2E3ZQnlIFFe9d5MQTMEqdYCdI_XtSWipmMzPSvXdGB6EDSsaUkvTk6e5u-jhmhPEx55RJxbfQHqOSRYwnyfZmVm8DNArhk_QliZBxuosGXCmWCin2UH4Oi9oZPNNtC97hF-2tdgWc4mcXugb8tw1gMI8u8FnR2trhB2g8BHCt_l2noL2z7h2_2vYDz2rrWjyp6s7gR_jqoI_aRzulrgKM1n2Inq8unyY30fT--nZyNo0KpkQbJQUv-59lTpigQumUJ4ZIpmOiFcmLNFW50TkIHvOiVLQ0BhKgeW5ik5RKl3yIjle5ja_7y6HN5jYUUFXaQd2FjCVSUcGSmPdStpIWvg7BQ5k13s61X2SUZEu82S_ebIk3W-PtTUfr_C6fg9lY_mD2gsOVwALAv0QmpFSC_wCWyn-Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2867152843</pqid></control><display><type>article</type><title>Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence</title><source>IEEE Electronic Library (IEL)</source><creator>Tan, Bo ; Xiao, Yang ; Wang, Yancheng ; Li, Shuai ; Yang, Jianyu ; Cao, Zhiguo ; Zhou, Joey Tianyi ; Yuan, Junsong</creator><creatorcontrib>Tan, Bo ; Xiao, Yang ; Wang, Yancheng ; Li, Shuai ; Yang, Jianyu ; Cao, Zhiguo ; Zhou, Joey Tianyi ; Yuan, Junsong</creatorcontrib><description>This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynamic voxel (3DV) with contrastive learning (CL). The 3DV can compress the point cloud sequence into a compact point cloud of 3-D motion information. Spatiotemporal data augmentations are conducted on it to drive CL. However, we find that existing CL methods (e.g., SimCLR or MoCo v2) often suffer from high pattern variance toward the augmented 3DV samples from the same action instance, that is, the augmented 3DV samples are still of high feature complementarity after CL, while the complementary discriminative clues within them have not been well exploited yet. To address this, a feature augmentation adapted CL (FACL) approach is proposed, which facilitates 3-D action representation via concerning the features from all augmented 3DV samples jointly, in spirit of feature augmentation. FACL runs in a global-local way: one branch learns global feature that involves the discriminative clues from the raw and augmented 3DV samples, and the other focuses on enhancing the discriminative power of local feature learned from each augmented 3DV sample. The global and local features are fused to characterize 3-D action jointly via concatenation. To fit FACL, a series of spatiotemporal data augmentation approaches is also studied on 3DV. Wide-range experiments verify the superiority of our unsupervised learning method for 3-D action feature learning. It outperforms the state-of-the-art skeleton-based counterparts by 6.4% and 3.6% with the cross-setup and cross-subject test settings on NTU RGB+D 120, respectively. The source code is available at https://github.com/tangent-T/FACL .</description><identifier>ISSN: 2162-237X</identifier><identifier>EISSN: 2162-2388</identifier><identifier>DOI: 10.1109/TNNLS.2023.3312673</identifier><identifier>PMID: 37729565</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Contrastive learning ; Contrastive learning (CL) ; Data augmentation ; feature augmentation ; Point cloud compression ; point cloud sequence ; Representation learning ; Skeleton ; Spatiotemporal phenomena ; Three-dimensional displays ; Training ; unsupervised 3-D action representation learning</subject><ispartof>IEEE transaction on neural networks and learning systems, 2024-12, Vol.35 (12), p.18186-18199</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c275t-8c3f2166b025157a938d062a40a70bc997bdabe5343cf71fdde8e1bbd4d8f7af3</cites><orcidid>0000-0002-7739-4146 ; 0000-0002-7324-7034 ; 0000-0002-0208-221X ; 0009-0000-4633-6026 ; 0000-0002-4675-7055 ; 0009-0008-1009-2195 ; 0000-0002-9223-1863</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10256675$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10256675$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37729565$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Tan, Bo</creatorcontrib><creatorcontrib>Xiao, Yang</creatorcontrib><creatorcontrib>Wang, Yancheng</creatorcontrib><creatorcontrib>Li, Shuai</creatorcontrib><creatorcontrib>Yang, Jianyu</creatorcontrib><creatorcontrib>Cao, Zhiguo</creatorcontrib><creatorcontrib>Zhou, Joey Tianyi</creatorcontrib><creatorcontrib>Yuan, Junsong</creatorcontrib><title>Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><description>This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynamic voxel (3DV) with contrastive learning (CL). The 3DV can compress the point cloud sequence into a compact point cloud of 3-D motion information. Spatiotemporal data augmentations are conducted on it to drive CL. However, we find that existing CL methods (e.g., SimCLR or MoCo v2) often suffer from high pattern variance toward the augmented 3DV samples from the same action instance, that is, the augmented 3DV samples are still of high feature complementarity after CL, while the complementary discriminative clues within them have not been well exploited yet. To address this, a feature augmentation adapted CL (FACL) approach is proposed, which facilitates 3-D action representation via concerning the features from all augmented 3DV samples jointly, in spirit of feature augmentation. FACL runs in a global-local way: one branch learns global feature that involves the discriminative clues from the raw and augmented 3DV samples, and the other focuses on enhancing the discriminative power of local feature learned from each augmented 3DV sample. The global and local features are fused to characterize 3-D action jointly via concatenation. To fit FACL, a series of spatiotemporal data augmentation approaches is also studied on 3DV. Wide-range experiments verify the superiority of our unsupervised learning method for 3-D action feature learning. It outperforms the state-of-the-art skeleton-based counterparts by 6.4% and 3.6% with the cross-setup and cross-subject test settings on NTU RGB+D 120, respectively. The source code is available at https://github.com/tangent-T/FACL .</description><subject>Contrastive learning</subject><subject>Contrastive learning (CL)</subject><subject>Data augmentation</subject><subject>feature augmentation</subject><subject>Point cloud compression</subject><subject>point cloud sequence</subject><subject>Representation learning</subject><subject>Skeleton</subject><subject>Spatiotemporal phenomena</subject><subject>Three-dimensional displays</subject><subject>Training</subject><subject>unsupervised 3-D action representation learning</subject><issn>2162-237X</issn><issn>2162-2388</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkMtOwzAQRS0Eoqj0BxBCXrJJ8SO2E3ZQnlIFFe9d5MQTMEqdYCdI_XtSWipmMzPSvXdGB6EDSsaUkvTk6e5u-jhmhPEx55RJxbfQHqOSRYwnyfZmVm8DNArhk_QliZBxuosGXCmWCin2UH4Oi9oZPNNtC97hF-2tdgWc4mcXugb8tw1gMI8u8FnR2trhB2g8BHCt_l2noL2z7h2_2vYDz2rrWjyp6s7gR_jqoI_aRzulrgKM1n2Inq8unyY30fT--nZyNo0KpkQbJQUv-59lTpigQumUJ4ZIpmOiFcmLNFW50TkIHvOiVLQ0BhKgeW5ik5RKl3yIjle5ja_7y6HN5jYUUFXaQd2FjCVSUcGSmPdStpIWvg7BQ5k13s61X2SUZEu82S_ebIk3W-PtTUfr_C6fg9lY_mD2gsOVwALAv0QmpFSC_wCWyn-Q</recordid><startdate>20241201</startdate><enddate>20241201</enddate><creator>Tan, Bo</creator><creator>Xiao, Yang</creator><creator>Wang, Yancheng</creator><creator>Li, Shuai</creator><creator>Yang, Jianyu</creator><creator>Cao, Zhiguo</creator><creator>Zhou, Joey Tianyi</creator><creator>Yuan, Junsong</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-7739-4146</orcidid><orcidid>https://orcid.org/0000-0002-7324-7034</orcidid><orcidid>https://orcid.org/0000-0002-0208-221X</orcidid><orcidid>https://orcid.org/0009-0000-4633-6026</orcidid><orcidid>https://orcid.org/0000-0002-4675-7055</orcidid><orcidid>https://orcid.org/0009-0008-1009-2195</orcidid><orcidid>https://orcid.org/0000-0002-9223-1863</orcidid></search><sort><creationdate>20241201</creationdate><title>Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence</title><author>Tan, Bo ; Xiao, Yang ; Wang, Yancheng ; Li, Shuai ; Yang, Jianyu ; Cao, Zhiguo ; Zhou, Joey Tianyi ; Yuan, Junsong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c275t-8c3f2166b025157a938d062a40a70bc997bdabe5343cf71fdde8e1bbd4d8f7af3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Contrastive learning</topic><topic>Contrastive learning (CL)</topic><topic>Data augmentation</topic><topic>feature augmentation</topic><topic>Point cloud compression</topic><topic>point cloud sequence</topic><topic>Representation learning</topic><topic>Skeleton</topic><topic>Spatiotemporal phenomena</topic><topic>Three-dimensional displays</topic><topic>Training</topic><topic>unsupervised 3-D action representation learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Tan, Bo</creatorcontrib><creatorcontrib>Xiao, Yang</creatorcontrib><creatorcontrib>Wang, Yancheng</creatorcontrib><creatorcontrib>Li, Shuai</creatorcontrib><creatorcontrib>Yang, Jianyu</creatorcontrib><creatorcontrib>Cao, Zhiguo</creatorcontrib><creatorcontrib>Zhou, Joey Tianyi</creatorcontrib><creatorcontrib>Yuan, Junsong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tan, Bo</au><au>Xiao, Yang</au><au>Wang, Yancheng</au><au>Li, Shuai</au><au>Yang, Jianyu</au><au>Cao, Zhiguo</au><au>Zhou, Joey Tianyi</au><au>Yuan, Junsong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><date>2024-12-01</date><risdate>2024</risdate><volume>35</volume><issue>12</issue><spage>18186</spage><epage>18199</epage><pages>18186-18199</pages><issn>2162-237X</issn><eissn>2162-2388</eissn><coden>ITNNAL</coden><abstract>This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynamic voxel (3DV) with contrastive learning (CL). The 3DV can compress the point cloud sequence into a compact point cloud of 3-D motion information. Spatiotemporal data augmentations are conducted on it to drive CL. However, we find that existing CL methods (e.g., SimCLR or MoCo v2) often suffer from high pattern variance toward the augmented 3DV samples from the same action instance, that is, the augmented 3DV samples are still of high feature complementarity after CL, while the complementary discriminative clues within them have not been well exploited yet. To address this, a feature augmentation adapted CL (FACL) approach is proposed, which facilitates 3-D action representation via concerning the features from all augmented 3DV samples jointly, in spirit of feature augmentation. FACL runs in a global-local way: one branch learns global feature that involves the discriminative clues from the raw and augmented 3DV samples, and the other focuses on enhancing the discriminative power of local feature learned from each augmented 3DV sample. The global and local features are fused to characterize 3-D action jointly via concatenation. To fit FACL, a series of spatiotemporal data augmentation approaches is also studied on 3DV. Wide-range experiments verify the superiority of our unsupervised learning method for 3-D action feature learning. It outperforms the state-of-the-art skeleton-based counterparts by 6.4% and 3.6% with the cross-setup and cross-subject test settings on NTU RGB+D 120, respectively. The source code is available at https://github.com/tangent-T/FACL .</abstract><cop>United States</cop><pub>IEEE</pub><pmid>37729565</pmid><doi>10.1109/TNNLS.2023.3312673</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-7739-4146</orcidid><orcidid>https://orcid.org/0000-0002-7324-7034</orcidid><orcidid>https://orcid.org/0000-0002-0208-221X</orcidid><orcidid>https://orcid.org/0009-0000-4633-6026</orcidid><orcidid>https://orcid.org/0000-0002-4675-7055</orcidid><orcidid>https://orcid.org/0009-0008-1009-2195</orcidid><orcidid>https://orcid.org/0000-0002-9223-1863</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2162-237X
ispartof	IEEE transaction on neural networks and learning systems, 2024-12, Vol.35 (12), p.18186-18199
issn	2162-237X 2162-2388
language	eng
recordid	cdi_crossref_primary_10_1109_TNNLS_2023_3312673
source	IEEE Electronic Library (IEL)
subjects	Contrastive learning Contrastive learning (CL) Data augmentation feature augmentation Point cloud compression point cloud sequence Representation learning Skeleton Spatiotemporal phenomena Three-dimensional displays Training unsupervised 3-D action representation learning
title	Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T15%3A16%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Beyond%20Pattern%20Variance:%20Unsupervised%203-D%20Action%20Representation%20Learning%20With%20Point%20Cloud%20Sequence&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Tan,%20Bo&rft.date=2024-12-01&rft.volume=35&rft.issue=12&rft.spage=18186&rft.epage=18199&rft.pages=18186-18199&rft.issn=2162-237X&rft.eissn=2162-2388&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2023.3312673&rft_dat=%3Cproquest_RIE%3E2867152843%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2867152843&rft_id=info:pmid/37729565&rft_ieee_id=10256675&rfr_iscdi=true