Unsupervised Temporal Correspondence Learning for Unified Video Object Removal

Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing 2023-12, Vol.PP, p.1-1
Hauptverfasser: Wang, Zhongdao, Wang, Jinglu, Li, Xiao, Li, Ya-Li, Lu, Yan, Wang, Shengjin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1
container_issue
container_start_page 1
container_title IEEE transactions on image processing
container_volume PP
creator Wang, Zhongdao
Wang, Jinglu
Li, Xiao
Li, Ya-Li
Lu, Yan
Wang, Shengjin
description Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.
doi_str_mv 10.1109/TIP.2023.3340605
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2902973461</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10359477</ieee_id><sourcerecordid>2902973461</sourcerecordid><originalsourceid>FETCH-LOGICAL-c188t-c0b8844f1ea3bd77801755eafe8eca3301a2b8d8320d7fe98dcd3383e67e07133</originalsourceid><addsrcrecordid>eNpNkL1PwzAQRy0EoqWwMyCUkSXlHDuxPaKKj0oVRahljRz7glIlcbCbSvz3pGpBTHfD-73hEXJNYUopqPvV_G2aQMKmjHHIID0hY6o4jQF4cjr8kIpYUK5G5CKEDQDlKc3OyYhJUCBTOiav6zb0HfpdFdBGK2w653UdzZz3GDrXWmwNRgvUvq3az6h0Plq3VVkN8Edl0UXLYoNmG71j43a6viRnpa4DXh3vhKyfHlezl3ixfJ7PHhaxoVJuYwOFlJyXFDUrrBASqEhT1CVKNJoxoDoppJUsAStKVNIay5hkmAkEQRmbkLuDt_Puq8ewzZsqGKxr3aLrQ54oSJRgPKMDCgfUeBeCxzLvfNVo_51TyPcV86Fivq-YHysOk9ujvS8atH-D32wDcHMAKkT852Op4kKwH8FHdks</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2902973461</pqid></control><display><type>article</type><title>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</title><source>IEEE Electronic Library (IEL)</source><creator>Wang, Zhongdao ; Wang, Jinglu ; Li, Xiao ; Li, Ya-Li ; Lu, Yan ; Wang, Shengjin</creator><creatorcontrib>Wang, Zhongdao ; Wang, Jinglu ; Li, Xiao ; Li, Ya-Li ; Lu, Yan ; Wang, Shengjin</creatorcontrib><description>Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2023.3340605</identifier><identifier>PMID: 38090851</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Annotations ; correspondence learning ; Image reconstruction ; Optical imaging ; Target tracking ; Task analysis ; Training ; unsupervised tracking ; video inpainting ; Video object removal ; Visualization</subject><ispartof>IEEE transactions on image processing, 2023-12, Vol.PP, p.1-1</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0001-7809-1932 ; 0000-0002-6629-7228 ; 0000-0002-4483-8783 ; 0000-0001-5383-6424 ; 0000-0002-3222-6579</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10359477$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10359477$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38090851$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Zhongdao</creatorcontrib><creatorcontrib>Wang, Jinglu</creatorcontrib><creatorcontrib>Li, Xiao</creatorcontrib><creatorcontrib>Li, Ya-Li</creatorcontrib><creatorcontrib>Lu, Yan</creatorcontrib><creatorcontrib>Wang, Shengjin</creatorcontrib><title>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.</description><subject>Annotations</subject><subject>correspondence learning</subject><subject>Image reconstruction</subject><subject>Optical imaging</subject><subject>Target tracking</subject><subject>Task analysis</subject><subject>Training</subject><subject>unsupervised tracking</subject><subject>video inpainting</subject><subject>Video object removal</subject><subject>Visualization</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkL1PwzAQRy0EoqWwMyCUkSXlHDuxPaKKj0oVRahljRz7glIlcbCbSvz3pGpBTHfD-73hEXJNYUopqPvV_G2aQMKmjHHIID0hY6o4jQF4cjr8kIpYUK5G5CKEDQDlKc3OyYhJUCBTOiav6zb0HfpdFdBGK2w653UdzZz3GDrXWmwNRgvUvq3az6h0Plq3VVkN8Edl0UXLYoNmG71j43a6viRnpa4DXh3vhKyfHlezl3ixfJ7PHhaxoVJuYwOFlJyXFDUrrBASqEhT1CVKNJoxoDoppJUsAStKVNIay5hkmAkEQRmbkLuDt_Puq8ewzZsqGKxr3aLrQ54oSJRgPKMDCgfUeBeCxzLvfNVo_51TyPcV86Fivq-YHysOk9ujvS8atH-D32wDcHMAKkT852Op4kKwH8FHdks</recordid><startdate>20231213</startdate><enddate>20231213</enddate><creator>Wang, Zhongdao</creator><creator>Wang, Jinglu</creator><creator>Li, Xiao</creator><creator>Li, Ya-Li</creator><creator>Lu, Yan</creator><creator>Wang, Shengjin</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-7809-1932</orcidid><orcidid>https://orcid.org/0000-0002-6629-7228</orcidid><orcidid>https://orcid.org/0000-0002-4483-8783</orcidid><orcidid>https://orcid.org/0000-0001-5383-6424</orcidid><orcidid>https://orcid.org/0000-0002-3222-6579</orcidid></search><sort><creationdate>20231213</creationdate><title>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</title><author>Wang, Zhongdao ; Wang, Jinglu ; Li, Xiao ; Li, Ya-Li ; Lu, Yan ; Wang, Shengjin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c188t-c0b8844f1ea3bd77801755eafe8eca3301a2b8d8320d7fe98dcd3383e67e07133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Annotations</topic><topic>correspondence learning</topic><topic>Image reconstruction</topic><topic>Optical imaging</topic><topic>Target tracking</topic><topic>Task analysis</topic><topic>Training</topic><topic>unsupervised tracking</topic><topic>video inpainting</topic><topic>Video object removal</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Zhongdao</creatorcontrib><creatorcontrib>Wang, Jinglu</creatorcontrib><creatorcontrib>Li, Xiao</creatorcontrib><creatorcontrib>Li, Ya-Li</creatorcontrib><creatorcontrib>Lu, Yan</creatorcontrib><creatorcontrib>Wang, Shengjin</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Zhongdao</au><au>Wang, Jinglu</au><au>Li, Xiao</au><au>Li, Ya-Li</au><au>Lu, Yan</au><au>Wang, Shengjin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2023-12-13</date><risdate>2023</risdate><volume>PP</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>38090851</pmid><doi>10.1109/TIP.2023.3340605</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0001-7809-1932</orcidid><orcidid>https://orcid.org/0000-0002-6629-7228</orcidid><orcidid>https://orcid.org/0000-0002-4483-8783</orcidid><orcidid>https://orcid.org/0000-0001-5383-6424</orcidid><orcidid>https://orcid.org/0000-0002-3222-6579</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1057-7149
ispartof IEEE transactions on image processing, 2023-12, Vol.PP, p.1-1
issn 1057-7149
1941-0042
language eng
recordid cdi_proquest_miscellaneous_2902973461
source IEEE Electronic Library (IEL)
subjects Annotations
correspondence learning
Image reconstruction
Optical imaging
Target tracking
Task analysis
Training
unsupervised tracking
video inpainting
Video object removal
Visualization
title Unsupervised Temporal Correspondence Learning for Unified Video Object Removal
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T18%3A37%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20Temporal%20Correspondence%20Learning%20for%20Unified%20Video%20Object%20Removal&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Wang,%20Zhongdao&rft.date=2023-12-13&rft.volume=PP&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2023.3340605&rft_dat=%3Cproquest_RIE%3E2902973461%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2902973461&rft_id=info:pmid/38090851&rft_ieee_id=10359477&rfr_iscdi=true