Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spa...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence 2022-07, Vol.44 (7), p.3791-3806
Hauptverfasser: Wang, Jiangliu, Jiao, Jianbo, Bao, Linchao, He, Shengfeng, Liu, Wei, Liu, Yun-hui
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3806
container_issue 7
container_start_page 3791
container_title IEEE transactions on pattern analysis and machine intelligence
container_volume 44
creator Wang, Jiangliu
Jiao, Jianbo
Bao, Linchao
He, Shengfeng
Liu, Wei
Liu, Yun-hui
description This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc . Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts .
doi_str_mv 10.1109/TPAMI.2021.3057833
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2672807841</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9352025</ieee_id><sourcerecordid>2672807841</sourcerecordid><originalsourceid>FETCH-LOGICAL-i259t-a912023508fb20f3197bb5a0b6cc0e5d6bf329bddede985762720f98b46d252a3</originalsourceid><addsrcrecordid>eNpd0EtLw0AQB_BFFFurX0BBAl68pO4ju9k9SvFRqCim7TXsJhPZkpe7SaHf3pRWD56G4f9jmBmErgmeEoLVw_Lj8W0-pZiSKcM8loydoDFRTIWMM3WKxpgIGkpJ5QhdeL_BmEQcs3M0YowLEfN4jNYJlEWY9C24rfWQB2ubQxN8QuvAQ93pzjZ1sADtalt_BWYXrOqs2YLbd0m7j8MlVG3jdBkke-47m_lLdFbo0sPVsU7Q6vlpOXsNF-8v89njIrSUqy7UigzrM45lYSguGFGxMVxjI7IMA8-FKRhVJs8hByV5LGg8MCVNJHLKqWYTdH-Y27rmuwffpZX1GZSlrqHpfUojKTmPIiwGevePbpre1cN2KRUxlTiWERnU7VH1poI8bZ2ttNulvx8bwM0BWAD4ixXjwyGc_QD_VHa6</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2672807841</pqid></control><display><type>article</type><title>Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics</title><source>IEEE Electronic Library (IEL)</source><creator>Wang, Jiangliu ; Jiao, Jianbo ; Bao, Linchao ; He, Shengfeng ; Liu, Wei ; Liu, Yun-hui</creator><creatorcontrib>Wang, Jiangliu ; Jiao, Jianbo ; Bao, Linchao ; He, Shengfeng ; Liu, Wei ; Liu, Yun-hui</creatorcontrib><description>This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc . Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts .</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 1939-3539</identifier><identifier>DOI: 10.1109/TPAMI.2021.3057833</identifier><identifier>PMID: 33566757</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>3D CNN ; Algorithms ; Cartesian coordinates ; Color ; Computer networks ; Feature extraction ; Humans ; Image color analysis ; Learning ; Motion ; Neural networks ; Neural Networks, Computer ; Recognition ; representation learning ; Representations ; Self-supervised learning ; Software ; Source code ; Summaries ; Task analysis ; Three-dimensional displays ; Training ; video understanding ; Visual fields ; Visual observation ; Visualization</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2022-07, Vol.44 (7), p.3791-3806</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-3802-4644 ; 0000-0002-3625-6679 ; 0000-0003-0833-5115 ; 0000-0001-9543-3754 ; 0000-0003-2734-0243 ; 0000-0002-3865-8145</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9352025$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>315,781,785,797,27929,27930,54763</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9352025$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/33566757$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Jiangliu</creatorcontrib><creatorcontrib>Jiao, Jianbo</creatorcontrib><creatorcontrib>Bao, Linchao</creatorcontrib><creatorcontrib>He, Shengfeng</creatorcontrib><creatorcontrib>Liu, Wei</creatorcontrib><creatorcontrib>Liu, Yun-hui</creatorcontrib><title>Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><description>This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc . Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts .</description><subject>3D CNN</subject><subject>Algorithms</subject><subject>Cartesian coordinates</subject><subject>Color</subject><subject>Computer networks</subject><subject>Feature extraction</subject><subject>Humans</subject><subject>Image color analysis</subject><subject>Learning</subject><subject>Motion</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Recognition</subject><subject>representation learning</subject><subject>Representations</subject><subject>Self-supervised learning</subject><subject>Software</subject><subject>Source code</subject><subject>Summaries</subject><subject>Task analysis</subject><subject>Three-dimensional displays</subject><subject>Training</subject><subject>video understanding</subject><subject>Visual fields</subject><subject>Visual observation</subject><subject>Visualization</subject><issn>0162-8828</issn><issn>1939-3539</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><sourceid>EIF</sourceid><recordid>eNpd0EtLw0AQB_BFFFurX0BBAl68pO4ju9k9SvFRqCim7TXsJhPZkpe7SaHf3pRWD56G4f9jmBmErgmeEoLVw_Lj8W0-pZiSKcM8loydoDFRTIWMM3WKxpgIGkpJ5QhdeL_BmEQcs3M0YowLEfN4jNYJlEWY9C24rfWQB2ubQxN8QuvAQ93pzjZ1sADtalt_BWYXrOqs2YLbd0m7j8MlVG3jdBkke-47m_lLdFbo0sPVsU7Q6vlpOXsNF-8v89njIrSUqy7UigzrM45lYSguGFGxMVxjI7IMA8-FKRhVJs8hByV5LGg8MCVNJHLKqWYTdH-Y27rmuwffpZX1GZSlrqHpfUojKTmPIiwGevePbpre1cN2KRUxlTiWERnU7VH1poI8bZ2ttNulvx8bwM0BWAD4ixXjwyGc_QD_VHa6</recordid><startdate>20220701</startdate><enddate>20220701</enddate><creator>Wang, Jiangliu</creator><creator>Jiao, Jianbo</creator><creator>Bao, Linchao</creator><creator>He, Shengfeng</creator><creator>Liu, Wei</creator><creator>Liu, Yun-hui</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-3802-4644</orcidid><orcidid>https://orcid.org/0000-0002-3625-6679</orcidid><orcidid>https://orcid.org/0000-0003-0833-5115</orcidid><orcidid>https://orcid.org/0000-0001-9543-3754</orcidid><orcidid>https://orcid.org/0000-0003-2734-0243</orcidid><orcidid>https://orcid.org/0000-0002-3865-8145</orcidid></search><sort><creationdate>20220701</creationdate><title>Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics</title><author>Wang, Jiangliu ; Jiao, Jianbo ; Bao, Linchao ; He, Shengfeng ; Liu, Wei ; Liu, Yun-hui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i259t-a912023508fb20f3197bb5a0b6cc0e5d6bf329bddede985762720f98b46d252a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>3D CNN</topic><topic>Algorithms</topic><topic>Cartesian coordinates</topic><topic>Color</topic><topic>Computer networks</topic><topic>Feature extraction</topic><topic>Humans</topic><topic>Image color analysis</topic><topic>Learning</topic><topic>Motion</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Recognition</topic><topic>representation learning</topic><topic>Representations</topic><topic>Self-supervised learning</topic><topic>Software</topic><topic>Source code</topic><topic>Summaries</topic><topic>Task analysis</topic><topic>Three-dimensional displays</topic><topic>Training</topic><topic>video understanding</topic><topic>Visual fields</topic><topic>Visual observation</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Jiangliu</creatorcontrib><creatorcontrib>Jiao, Jianbo</creatorcontrib><creatorcontrib>Bao, Linchao</creatorcontrib><creatorcontrib>He, Shengfeng</creatorcontrib><creatorcontrib>Liu, Wei</creatorcontrib><creatorcontrib>Liu, Yun-hui</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Jiangliu</au><au>Jiao, Jianbo</au><au>Bao, Linchao</au><au>He, Shengfeng</au><au>Liu, Wei</au><au>Liu, Yun-hui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><date>2022-07-01</date><risdate>2022</risdate><volume>44</volume><issue>7</issue><spage>3791</spage><epage>3806</epage><pages>3791-3806</pages><issn>0162-8828</issn><eissn>1939-3539</eissn><coden>ITPIDJ</coden><abstract>This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc . Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts .</abstract><cop>United States</cop><pub>IEEE</pub><pmid>33566757</pmid><doi>10.1109/TPAMI.2021.3057833</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0002-3802-4644</orcidid><orcidid>https://orcid.org/0000-0002-3625-6679</orcidid><orcidid>https://orcid.org/0000-0003-0833-5115</orcidid><orcidid>https://orcid.org/0000-0001-9543-3754</orcidid><orcidid>https://orcid.org/0000-0003-2734-0243</orcidid><orcidid>https://orcid.org/0000-0002-3865-8145</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 0162-8828
ispartof IEEE transactions on pattern analysis and machine intelligence, 2022-07, Vol.44 (7), p.3791-3806
issn 0162-8828
1939-3539
language eng
recordid cdi_proquest_journals_2672807841
source IEEE Electronic Library (IEL)
subjects 3D CNN
Algorithms
Cartesian coordinates
Color
Computer networks
Feature extraction
Humans
Image color analysis
Learning
Motion
Neural networks
Neural Networks, Computer
Recognition
representation learning
Representations
Self-supervised learning
Software
Source code
Summaries
Task analysis
Three-dimensional displays
Training
video understanding
Visual fields
Visual observation
Visualization
title Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-14T13%3A20%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Self-Supervised%20Video%20Representation%20Learning%20by%20Uncovering%20Spatio-Temporal%20Statistics&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Wang,%20Jiangliu&rft.date=2022-07-01&rft.volume=44&rft.issue=7&rft.spage=3791&rft.epage=3806&rft.pages=3791-3806&rft.issn=0162-8828&rft.eissn=1939-3539&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2021.3057833&rft_dat=%3Cproquest_RIE%3E2672807841%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2672807841&rft_id=info:pmid/33566757&rft_ieee_id=9352025&rfr_iscdi=true