Unsupervised Temporal Correspondence Learning for Unified Video Object Removal

Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing 2023-12, Vol.PP, p.1-1
Hauptverfasser:	Wang, Zhongdao, Wang, Jinglu, Li, Xiao, Li, Ya-Li, Lu, Yan, Wang, Shengjin
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations correspondence learning Image reconstruction Optical imaging Target tracking Task analysis Training unsupervised tracking video inpainting Video object removal Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1
container_issue
container_start_page	1
container_title	IEEE transactions on image processing
container_volume	PP
creator	Wang, Zhongdao Wang, Jinglu Li, Xiao Li, Ya-Li Lu, Yan Wang, Shengjin
description	Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.
doi_str_mv	10.1109/TIP.2023.3340605
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2902973461</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10359477</ieee_id><sourcerecordid>2902973461</sourcerecordid><originalsourceid>FETCH-LOGICAL-c188t-c0b8844f1ea3bd77801755eafe8eca3301a2b8d8320d7fe98dcd3383e67e07133</originalsourceid><addsrcrecordid>eNpNkL1PwzAQRy0EoqWwMyCUkSXlHDuxPaKKj0oVRahljRz7glIlcbCbSvz3pGpBTHfD-73hEXJNYUopqPvV_G2aQMKmjHHIID0hY6o4jQF4cjr8kIpYUK5G5CKEDQDlKc3OyYhJUCBTOiav6zb0HfpdFdBGK2w653UdzZz3GDrXWmwNRgvUvq3az6h0Plq3VVkN8Edl0UXLYoNmG71j43a6viRnpa4DXh3vhKyfHlezl3ixfJ7PHhaxoVJuYwOFlJyXFDUrrBASqEhT1CVKNJoxoDoppJUsAStKVNIay5hkmAkEQRmbkLuDt_Puq8ewzZsqGKxr3aLrQ54oSJRgPKMDCgfUeBeCxzLvfNVo_51TyPcV86Fivq-YHysOk9ujvS8atH-D32wDcHMAKkT852Op4kKwH8FHdks</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2902973461</pqid></control><display><type>article</type><title>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</title><source>IEEE Electronic Library (IEL)</source><creator>Wang, Zhongdao ; Wang, Jinglu ; Li, Xiao ; Li, Ya-Li ; Lu, Yan ; Wang, Shengjin</creator><creatorcontrib>Wang, Zhongdao ; Wang, Jinglu ; Li, Xiao ; Li, Ya-Li ; Lu, Yan ; Wang, Shengjin</creatorcontrib><description>Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2023.3340605</identifier><identifier>PMID: 38090851</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Annotations ; correspondence learning ; Image reconstruction ; Optical imaging ; Target tracking ; Task analysis ; Training ; unsupervised tracking ; video inpainting ; Video object removal ; Visualization</subject><ispartof>IEEE transactions on image processing, 2023-12, Vol.PP, p.1-1</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0001-7809-1932 ; 0000-0002-6629-7228 ; 0000-0002-4483-8783 ; 0000-0001-5383-6424 ; 0000-0002-3222-6579</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10359477$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10359477$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38090851$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Zhongdao</creatorcontrib><creatorcontrib>Wang, Jinglu</creatorcontrib><creatorcontrib>Li, Xiao</creatorcontrib><creatorcontrib>Li, Ya-Li</creatorcontrib><creatorcontrib>Lu, Yan</creatorcontrib><creatorcontrib>Wang, Shengjin</creatorcontrib><title>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.</description><subject>Annotations</subject><subject>correspondence learning</subject><subject>Image reconstruction</subject><subject>Optical imaging</subject><subject>Target tracking</subject><subject>Task analysis</subject><subject>Training</subject><subject>unsupervised tracking</subject><subject>video inpainting</subject><subject>Video object removal</subject><subject>Visualization</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkL1PwzAQRy0EoqWwMyCUkSXlHDuxPaKKj0oVRahljRz7glIlcbCbSvz3pGpBTHfD-73hEXJNYUopqPvV_G2aQMKmjHHIID0hY6o4jQF4cjr8kIpYUK5G5CKEDQDlKc3OyYhJUCBTOiav6zb0HfpdFdBGK2w653UdzZz3GDrXWmwNRgvUvq3az6h0Plq3VVkN8Edl0UXLYoNmG71j43a6viRnpa4DXh3vhKyfHlezl3ixfJ7PHhaxoVJuYwOFlJyXFDUrrBASqEhT1CVKNJoxoDoppJUsAStKVNIay5hkmAkEQRmbkLuDt_Puq8ewzZsqGKxr3aLrQ54oSJRgPKMDCgfUeBeCxzLvfNVo_51TyPcV86Fivq-YHysOk9ujvS8atH-D32wDcHMAKkT852Op4kKwH8FHdks</recordid><startdate>20231213</startdate><enddate>20231213</enddate><creator>Wang, Zhongdao</creator><creator>Wang, Jinglu</creator><creator>Li, Xiao</creator><creator>Li, Ya-Li</creator><creator>Lu, Yan</creator><creator>Wang, Shengjin</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-7809-1932</orcidid><orcidid>https://orcid.org/0000-0002-6629-7228</orcidid><orcidid>https://orcid.org/0000-0002-4483-8783</orcidid><orcidid>https://orcid.org/0000-0001-5383-6424</orcidid><orcidid>https://orcid.org/0000-0002-3222-6579</orcidid></search><sort><creationdate>20231213</creationdate><title>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</title><author>Wang, Zhongdao ; Wang, Jinglu ; Li, Xiao ; Li, Ya-Li ; Lu, Yan ; Wang, Shengjin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c188t-c0b8844f1ea3bd77801755eafe8eca3301a2b8d8320d7fe98dcd3383e67e07133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Annotations</topic><topic>correspondence learning</topic><topic>Image reconstruction</topic><topic>Optical imaging</topic><topic>Target tracking</topic><topic>Task analysis</topic><topic>Training</topic><topic>unsupervised tracking</topic><topic>video inpainting</topic><topic>Video object removal</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Zhongdao</creatorcontrib><creatorcontrib>Wang, Jinglu</creatorcontrib><creatorcontrib>Li, Xiao</creatorcontrib><creatorcontrib>Li, Ya-Li</creatorcontrib><creatorcontrib>Lu, Yan</creatorcontrib><creatorcontrib>Wang, Shengjin</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Zhongdao</au><au>Wang, Jinglu</au><au>Li, Xiao</au><au>Li, Ya-Li</au><au>Lu, Yan</au><au>Wang, Shengjin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2023-12-13</date><risdate>2023</risdate><volume>PP</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>38090851</pmid><doi>10.1109/TIP.2023.3340605</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0001-7809-1932</orcidid><orcidid>https://orcid.org/0000-0002-6629-7228</orcidid><orcidid>https://orcid.org/0000-0002-4483-8783</orcidid><orcidid>https://orcid.org/0000-0001-5383-6424</orcidid><orcidid>https://orcid.org/0000-0002-3222-6579</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1057-7149
ispartof	IEEE transactions on image processing, 2023-12, Vol.PP, p.1-1
issn	1057-7149 1941-0042
language	eng
recordid	cdi_proquest_miscellaneous_2902973461
source	IEEE Electronic Library (IEL)
subjects	Annotations correspondence learning Image reconstruction Optical imaging Target tracking Task analysis Training unsupervised tracking video inpainting Video object removal Visualization
title	Unsupervised Temporal Correspondence Learning for Unified Video Object Removal
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T18%3A37%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20Temporal%20Correspondence%20Learning%20for%20Unified%20Video%20Object%20Removal&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Wang,%20Zhongdao&rft.date=2023-12-13&rft.volume=PP&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2023.3340605&rft_dat=%3Cproquest_RIE%3E2902973461%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2902973461&rft_id=info:pmid/38090851&rft_ieee_id=10359477&rfr_iscdi=true