Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition

Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhang, Shaojie, Yin, Jianqin, Dang, Yonghao
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhang, Shaojie Yin, Jianqin Dang, Yonghao
description	Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.
doi_str_mv	10.48550/arxiv.2312.15144
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_15144</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_15144</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-9d5bbbf5c376f71d77e234d07d32e6af619d739b12d990fd2cc9e6ad77e960913</originalsourceid><addsrcrecordid>eNotj81KxDAcxHPxIKsP4Mm8QGqTtIk5LvVjhYLg9iiUf_OxBNukpNlF397t6mmGYWbgh9AdLYvqsa7LB0jf_lQwTllBa1pV1-hzP0P2MJLOTnNMMOInq-NxHn044CaGnGDJ_mRxayGFNXQx4f2XHW2OgQywWIN3xwkC3ursY8Af5_0h-NXfoCsH42Jv_3WDupfnrtmR9v31rdm2BISsiDL1MAyu1lwKJ6mR0jJemVIazqwAJ6gykquBMqNU6QzTWp3ztadEqSjfoPu_2wtePyc_QfrpV8z-gsl_Ae4eTr0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition</title><source>arXiv.org</source><creator>Zhang, Shaojie ; Yin, Jianqin ; Dang, Yonghao</creator><creatorcontrib>Zhang, Shaojie ; Yin, Jianqin ; Dang, Yonghao</creatorcontrib><description>Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.</description><identifier>DOI: 10.48550/arxiv.2312.15144</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.15144$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.15144$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Shaojie</creatorcontrib><creatorcontrib>Yin, Jianqin</creatorcontrib><creatorcontrib>Dang, Yonghao</creatorcontrib><title>Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition</title><description>Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAcxHPxIKsP4Mm8QGqTtIk5LvVjhYLg9iiUf_OxBNukpNlF397t6mmGYWbgh9AdLYvqsa7LB0jf_lQwTllBa1pV1-hzP0P2MJLOTnNMMOInq-NxHn044CaGnGDJ_mRxayGFNXQx4f2XHW2OgQywWIN3xwkC3ursY8Af5_0h-NXfoCsH42Jv_3WDupfnrtmR9v31rdm2BISsiDL1MAyu1lwKJ6mR0jJemVIazqwAJ6gykquBMqNU6QzTWp3ztadEqSjfoPu_2wtePyc_QfrpV8z-gsl_Ae4eTr0</recordid><startdate>20231222</startdate><enddate>20231222</enddate><creator>Zhang, Shaojie</creator><creator>Yin, Jianqin</creator><creator>Dang, Yonghao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231222</creationdate><title>Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition</title><author>Zhang, Shaojie ; Yin, Jianqin ; Dang, Yonghao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-9d5bbbf5c376f71d77e234d07d32e6af619d739b12d990fd2cc9e6ad77e960913</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Shaojie</creatorcontrib><creatorcontrib>Yin, Jianqin</creatorcontrib><creatorcontrib>Dang, Yonghao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Shaojie</au><au>Yin, Jianqin</au><au>Dang, Yonghao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition</atitle><date>2023-12-22</date><risdate>2023</risdate><abstract>Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.</abstract><doi>10.48550/arxiv.2312.15144</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2312.15144
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2312_15144
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T21%3A59%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Spatial-Temporal%20Decoupling%20Contrastive%20Learning%20for%20Skeleton-based%20Human%20Action%20Recognition&rft.au=Zhang,%20Shaojie&rft.date=2023-12-22&rft_id=info:doi/10.48550/arxiv.2312.15144&rft_dat=%3Carxiv_GOX%3E2312_15144%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true