Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition

Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zhang, Shaojie, Yin, Jianqin, Dang, Yonghao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Zhang, Shaojie
Yin, Jianqin
Dang, Yonghao
description Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.
doi_str_mv 10.48550/arxiv.2312.15144
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_15144</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_15144</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-9d5bbbf5c376f71d77e234d07d32e6af619d739b12d990fd2cc9e6ad77e960913</originalsourceid><addsrcrecordid>eNotj81KxDAcxHPxIKsP4Mm8QGqTtIk5LvVjhYLg9iiUf_OxBNukpNlF397t6mmGYWbgh9AdLYvqsa7LB0jf_lQwTllBa1pV1-hzP0P2MJLOTnNMMOInq-NxHn044CaGnGDJ_mRxayGFNXQx4f2XHW2OgQywWIN3xwkC3ursY8Af5_0h-NXfoCsH42Jv_3WDupfnrtmR9v31rdm2BISsiDL1MAyu1lwKJ6mR0jJemVIazqwAJ6gykquBMqNU6QzTWp3ztadEqSjfoPu_2wtePyc_QfrpV8z-gsl_Ae4eTr0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition</title><source>arXiv.org</source><creator>Zhang, Shaojie ; Yin, Jianqin ; Dang, Yonghao</creator><creatorcontrib>Zhang, Shaojie ; Yin, Jianqin ; Dang, Yonghao</creatorcontrib><description>Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.</description><identifier>DOI: 10.48550/arxiv.2312.15144</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.15144$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.15144$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Shaojie</creatorcontrib><creatorcontrib>Yin, Jianqin</creatorcontrib><creatorcontrib>Dang, Yonghao</creatorcontrib><title>Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition</title><description>Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAcxHPxIKsP4Mm8QGqTtIk5LvVjhYLg9iiUf_OxBNukpNlF397t6mmGYWbgh9AdLYvqsa7LB0jf_lQwTllBa1pV1-hzP0P2MJLOTnNMMOInq-NxHn044CaGnGDJ_mRxayGFNXQx4f2XHW2OgQywWIN3xwkC3ursY8Af5_0h-NXfoCsH42Jv_3WDupfnrtmR9v31rdm2BISsiDL1MAyu1lwKJ6mR0jJemVIazqwAJ6gykquBMqNU6QzTWp3ztadEqSjfoPu_2wtePyc_QfrpV8z-gsl_Ae4eTr0</recordid><startdate>20231222</startdate><enddate>20231222</enddate><creator>Zhang, Shaojie</creator><creator>Yin, Jianqin</creator><creator>Dang, Yonghao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231222</creationdate><title>Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition</title><author>Zhang, Shaojie ; Yin, Jianqin ; Dang, Yonghao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-9d5bbbf5c376f71d77e234d07d32e6af619d739b12d990fd2cc9e6ad77e960913</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Shaojie</creatorcontrib><creatorcontrib>Yin, Jianqin</creatorcontrib><creatorcontrib>Dang, Yonghao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Shaojie</au><au>Yin, Jianqin</au><au>Dang, Yonghao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition</atitle><date>2023-12-22</date><risdate>2023</risdate><abstract>Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released soon.</abstract><doi>10.48550/arxiv.2312.15144</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2312.15144
ispartof
issn
language eng
recordid cdi_arxiv_primary_2312_15144
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T21%3A59%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Spatial-Temporal%20Decoupling%20Contrastive%20Learning%20for%20Skeleton-based%20Human%20Action%20Recognition&rft.au=Zhang,%20Shaojie&rft.date=2023-12-22&rft_id=info:doi/10.48550/arxiv.2312.15144&rft_dat=%3Carxiv_GOX%3E2312_15144%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true