Fine-grained Spatio-temporal Parsing Network for Action Quality Assessment

Action Quality Assessment (AQA) plays an important role in video analysis, which is applied to evaluate the quality of specific actions, i.e ., sports activities. However, it is still challenging because there are lots of small action discrepancies with similar backgrounds, but current approaches mo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing 2023-01, Vol.32, p.1-1
Hauptverfasser: Gedamu, Kumie, Ji, Yanli, Yang, Yang, Shao, Jie, Shen, Heng Tao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1
container_issue
container_start_page 1
container_title IEEE transactions on image processing
container_volume 32
creator Gedamu, Kumie
Ji, Yanli
Yang, Yang
Shao, Jie
Shen, Heng Tao
description Action Quality Assessment (AQA) plays an important role in video analysis, which is applied to evaluate the quality of specific actions, i.e ., sports activities. However, it is still challenging because there are lots of small action discrepancies with similar backgrounds, but current approaches mostly adopt holistic video representations. So that fine-grained intra-class variations are unable to be captured. To address the aforementioned challenge, we propose a Fine-grained Spatio-temporal Parsing Network (FSPN) which is composed of the intra-sequence action parsing module and spatiotemporal multiscale transformer module to learn fine-grained spatiotemporal sub-action representations for more reliable AQA. The intra-sequence action parsing module performs semantical sub-action parsing by mining sub-actions at fine-grained levels. It enables a correct description of the subtle differences between action sequences. The spatiotemporal multiscale transformer module learns motion-oriented action features and obtains their long-range dependencies among sub-actions at different scales. Furthermore, we design a group contrastive loss to train the model and learn more discriminative feature representations for sub-actions without explicit supervision. We exhaustively evaluate our proposed approach in the FineDiving, AQA-7, and MTL-AQA datasets. Extensive experiment results demonstrate the effectiveness and feasibility of our proposed approach, which outperforms the state-of-the-art methods by a significant margin.
doi_str_mv 10.1109/TIP.2023.3331212
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2892375349</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10317826</ieee_id><sourcerecordid>2892375349</sourcerecordid><originalsourceid>FETCH-LOGICAL-c278t-55b5b5684a4a717f65e610ede57cafeb7680886b88eb5a1a1ffd52541a3e0f513</originalsourceid><addsrcrecordid>eNpdkE1Lw0AQhoMoWKt3Dx4CXryk7ux3jqVYrRStWM9hk05KapKNuwnSf--WehCZwzuH5x2GJ4qugUwASHq_XqwmlFA2YYwBBXoSjSDlkBDC6WnYiVCJAp6eRxfe7wgBLkCOoud51WKydSbEJn7vTF_ZpMems87U8co4X7Xb-AX7b-s-49K6eFoEpI3fBlNX_T6eeo_eN9j2l9FZaWqPV785jj7mD-vZU7J8fVzMpsukoEr3iRB5GKm54UaBKqVACQQ3KFRhSsyV1ERrmWuNuTBgoCw3ggoOhiEpBbBxdHe82zn7NaDvs6byBda1adEOPqM6JUxyTlRAb_-hOzu4Nnx3oChTgvE0UORIFc5677DMOlc1xu0zINlBbhbkZge52a_cULk5VipE_IMzUJpK9gP8jXTe</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2892375349</pqid></control><display><type>article</type><title>Fine-grained Spatio-temporal Parsing Network for Action Quality Assessment</title><source>IEEE Electronic Library (IEL)</source><creator>Gedamu, Kumie ; Ji, Yanli ; Yang, Yang ; Shao, Jie ; Shen, Heng Tao</creator><creatorcontrib>Gedamu, Kumie ; Ji, Yanli ; Yang, Yang ; Shao, Jie ; Shen, Heng Tao</creatorcontrib><description>Action Quality Assessment (AQA) plays an important role in video analysis, which is applied to evaluate the quality of specific actions, i.e ., sports activities. However, it is still challenging because there are lots of small action discrepancies with similar backgrounds, but current approaches mostly adopt holistic video representations. So that fine-grained intra-class variations are unable to be captured. To address the aforementioned challenge, we propose a Fine-grained Spatio-temporal Parsing Network (FSPN) which is composed of the intra-sequence action parsing module and spatiotemporal multiscale transformer module to learn fine-grained spatiotemporal sub-action representations for more reliable AQA. The intra-sequence action parsing module performs semantical sub-action parsing by mining sub-actions at fine-grained levels. It enables a correct description of the subtle differences between action sequences. The spatiotemporal multiscale transformer module learns motion-oriented action features and obtains their long-range dependencies among sub-actions at different scales. Furthermore, we design a group contrastive loss to train the model and learn more discriminative feature representations for sub-actions without explicit supervision. We exhaustively evaluate our proposed approach in the FineDiving, AQA-7, and MTL-AQA datasets. Extensive experiment results demonstrate the effectiveness and feasibility of our proposed approach, which outperforms the state-of-the-art methods by a significant margin.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2023.3331212</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Action parsing ; Action quality assessment ; Decoding ; Fine-grained representation ; Modules ; Multiscale transformer ; Quality assessment ; Representations ; Semantics ; Spatiotemporal phenomena ; Sports ; Task analysis ; Transformers</subject><ispartof>IEEE transactions on image processing, 2023-01, Vol.32, p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c278t-55b5b5684a4a717f65e610ede57cafeb7680886b88eb5a1a1ffd52541a3e0f513</cites><orcidid>0000-0002-5070-4511 ; 0000-0003-2615-1555 ; 0000-0002-6458-1882 ; 0000-0001-9122-6141 ; 0000-0002-2999-2088</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10317826$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10317826$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Gedamu, Kumie</creatorcontrib><creatorcontrib>Ji, Yanli</creatorcontrib><creatorcontrib>Yang, Yang</creatorcontrib><creatorcontrib>Shao, Jie</creatorcontrib><creatorcontrib>Shen, Heng Tao</creatorcontrib><title>Fine-grained Spatio-temporal Parsing Network for Action Quality Assessment</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><description>Action Quality Assessment (AQA) plays an important role in video analysis, which is applied to evaluate the quality of specific actions, i.e ., sports activities. However, it is still challenging because there are lots of small action discrepancies with similar backgrounds, but current approaches mostly adopt holistic video representations. So that fine-grained intra-class variations are unable to be captured. To address the aforementioned challenge, we propose a Fine-grained Spatio-temporal Parsing Network (FSPN) which is composed of the intra-sequence action parsing module and spatiotemporal multiscale transformer module to learn fine-grained spatiotemporal sub-action representations for more reliable AQA. The intra-sequence action parsing module performs semantical sub-action parsing by mining sub-actions at fine-grained levels. It enables a correct description of the subtle differences between action sequences. The spatiotemporal multiscale transformer module learns motion-oriented action features and obtains their long-range dependencies among sub-actions at different scales. Furthermore, we design a group contrastive loss to train the model and learn more discriminative feature representations for sub-actions without explicit supervision. We exhaustively evaluate our proposed approach in the FineDiving, AQA-7, and MTL-AQA datasets. Extensive experiment results demonstrate the effectiveness and feasibility of our proposed approach, which outperforms the state-of-the-art methods by a significant margin.</description><subject>Action parsing</subject><subject>Action quality assessment</subject><subject>Decoding</subject><subject>Fine-grained representation</subject><subject>Modules</subject><subject>Multiscale transformer</subject><subject>Quality assessment</subject><subject>Representations</subject><subject>Semantics</subject><subject>Spatiotemporal phenomena</subject><subject>Sports</subject><subject>Task analysis</subject><subject>Transformers</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkE1Lw0AQhoMoWKt3Dx4CXryk7ux3jqVYrRStWM9hk05KapKNuwnSf--WehCZwzuH5x2GJ4qugUwASHq_XqwmlFA2YYwBBXoSjSDlkBDC6WnYiVCJAp6eRxfe7wgBLkCOoud51WKydSbEJn7vTF_ZpMems87U8co4X7Xb-AX7b-s-49K6eFoEpI3fBlNX_T6eeo_eN9j2l9FZaWqPV785jj7mD-vZU7J8fVzMpsukoEr3iRB5GKm54UaBKqVACQQ3KFRhSsyV1ERrmWuNuTBgoCw3ggoOhiEpBbBxdHe82zn7NaDvs6byBda1adEOPqM6JUxyTlRAb_-hOzu4Nnx3oChTgvE0UORIFc5677DMOlc1xu0zINlBbhbkZge52a_cULk5VipE_IMzUJpK9gP8jXTe</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Gedamu, Kumie</creator><creator>Ji, Yanli</creator><creator>Yang, Yang</creator><creator>Shao, Jie</creator><creator>Shen, Heng Tao</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-5070-4511</orcidid><orcidid>https://orcid.org/0000-0003-2615-1555</orcidid><orcidid>https://orcid.org/0000-0002-6458-1882</orcidid><orcidid>https://orcid.org/0000-0001-9122-6141</orcidid><orcidid>https://orcid.org/0000-0002-2999-2088</orcidid></search><sort><creationdate>20230101</creationdate><title>Fine-grained Spatio-temporal Parsing Network for Action Quality Assessment</title><author>Gedamu, Kumie ; Ji, Yanli ; Yang, Yang ; Shao, Jie ; Shen, Heng Tao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c278t-55b5b5684a4a717f65e610ede57cafeb7680886b88eb5a1a1ffd52541a3e0f513</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Action parsing</topic><topic>Action quality assessment</topic><topic>Decoding</topic><topic>Fine-grained representation</topic><topic>Modules</topic><topic>Multiscale transformer</topic><topic>Quality assessment</topic><topic>Representations</topic><topic>Semantics</topic><topic>Spatiotemporal phenomena</topic><topic>Sports</topic><topic>Task analysis</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Gedamu, Kumie</creatorcontrib><creatorcontrib>Ji, Yanli</creatorcontrib><creatorcontrib>Yang, Yang</creatorcontrib><creatorcontrib>Shao, Jie</creatorcontrib><creatorcontrib>Shen, Heng Tao</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Gedamu, Kumie</au><au>Ji, Yanli</au><au>Yang, Yang</au><au>Shao, Jie</au><au>Shen, Heng Tao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Fine-grained Spatio-temporal Parsing Network for Action Quality Assessment</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>32</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Action Quality Assessment (AQA) plays an important role in video analysis, which is applied to evaluate the quality of specific actions, i.e ., sports activities. However, it is still challenging because there are lots of small action discrepancies with similar backgrounds, but current approaches mostly adopt holistic video representations. So that fine-grained intra-class variations are unable to be captured. To address the aforementioned challenge, we propose a Fine-grained Spatio-temporal Parsing Network (FSPN) which is composed of the intra-sequence action parsing module and spatiotemporal multiscale transformer module to learn fine-grained spatiotemporal sub-action representations for more reliable AQA. The intra-sequence action parsing module performs semantical sub-action parsing by mining sub-actions at fine-grained levels. It enables a correct description of the subtle differences between action sequences. The spatiotemporal multiscale transformer module learns motion-oriented action features and obtains their long-range dependencies among sub-actions at different scales. Furthermore, we design a group contrastive loss to train the model and learn more discriminative feature representations for sub-actions without explicit supervision. We exhaustively evaluate our proposed approach in the FineDiving, AQA-7, and MTL-AQA datasets. Extensive experiment results demonstrate the effectiveness and feasibility of our proposed approach, which outperforms the state-of-the-art methods by a significant margin.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TIP.2023.3331212</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-5070-4511</orcidid><orcidid>https://orcid.org/0000-0003-2615-1555</orcidid><orcidid>https://orcid.org/0000-0002-6458-1882</orcidid><orcidid>https://orcid.org/0000-0001-9122-6141</orcidid><orcidid>https://orcid.org/0000-0002-2999-2088</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1057-7149
ispartof IEEE transactions on image processing, 2023-01, Vol.32, p.1-1
issn 1057-7149
1941-0042
language eng
recordid cdi_proquest_journals_2892375349
source IEEE Electronic Library (IEL)
subjects Action parsing
Action quality assessment
Decoding
Fine-grained representation
Modules
Multiscale transformer
Quality assessment
Representations
Semantics
Spatiotemporal phenomena
Sports
Task analysis
Transformers
title Fine-grained Spatio-temporal Parsing Network for Action Quality Assessment
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T19%3A06%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Fine-grained%20Spatio-temporal%20Parsing%20Network%20for%20Action%20Quality%20Assessment&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Gedamu,%20Kumie&rft.date=2023-01-01&rft.volume=32&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2023.3331212&rft_dat=%3Cproquest_RIE%3E2892375349%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2892375349&rft_id=info:pmid/&rft_ieee_id=10317826&rfr_iscdi=true