Unsupervised Temporal Correspondence Learning for Unified Video Object Removal
Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separat...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on image processing 2023-12, Vol.PP, p.1-1 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE transactions on image processing |
container_volume | PP |
creator | Wang, Zhongdao Wang, Jinglu Li, Xiao Li, Ya-Li Lu, Yan Wang, Shengjin |
description | Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases. |
doi_str_mv | 10.1109/TIP.2023.3340605 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2902973461</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10359477</ieee_id><sourcerecordid>2902973461</sourcerecordid><originalsourceid>FETCH-LOGICAL-c188t-c0b8844f1ea3bd77801755eafe8eca3301a2b8d8320d7fe98dcd3383e67e07133</originalsourceid><addsrcrecordid>eNpNkL1PwzAQRy0EoqWwMyCUkSXlHDuxPaKKj0oVRahljRz7glIlcbCbSvz3pGpBTHfD-73hEXJNYUopqPvV_G2aQMKmjHHIID0hY6o4jQF4cjr8kIpYUK5G5CKEDQDlKc3OyYhJUCBTOiav6zb0HfpdFdBGK2w653UdzZz3GDrXWmwNRgvUvq3az6h0Plq3VVkN8Edl0UXLYoNmG71j43a6viRnpa4DXh3vhKyfHlezl3ixfJ7PHhaxoVJuYwOFlJyXFDUrrBASqEhT1CVKNJoxoDoppJUsAStKVNIay5hkmAkEQRmbkLuDt_Puq8ewzZsqGKxr3aLrQ54oSJRgPKMDCgfUeBeCxzLvfNVo_51TyPcV86Fivq-YHysOk9ujvS8atH-D32wDcHMAKkT852Op4kKwH8FHdks</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2902973461</pqid></control><display><type>article</type><title>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</title><source>IEEE Electronic Library (IEL)</source><creator>Wang, Zhongdao ; Wang, Jinglu ; Li, Xiao ; Li, Ya-Li ; Lu, Yan ; Wang, Shengjin</creator><creatorcontrib>Wang, Zhongdao ; Wang, Jinglu ; Li, Xiao ; Li, Ya-Li ; Lu, Yan ; Wang, Shengjin</creatorcontrib><description>Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2023.3340605</identifier><identifier>PMID: 38090851</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Annotations ; correspondence learning ; Image reconstruction ; Optical imaging ; Target tracking ; Task analysis ; Training ; unsupervised tracking ; video inpainting ; Video object removal ; Visualization</subject><ispartof>IEEE transactions on image processing, 2023-12, Vol.PP, p.1-1</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0001-7809-1932 ; 0000-0002-6629-7228 ; 0000-0002-4483-8783 ; 0000-0001-5383-6424 ; 0000-0002-3222-6579</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10359477$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27905,27906,54739</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10359477$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38090851$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Zhongdao</creatorcontrib><creatorcontrib>Wang, Jinglu</creatorcontrib><creatorcontrib>Li, Xiao</creatorcontrib><creatorcontrib>Li, Ya-Li</creatorcontrib><creatorcontrib>Lu, Yan</creatorcontrib><creatorcontrib>Wang, Shengjin</creatorcontrib><title>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.</description><subject>Annotations</subject><subject>correspondence learning</subject><subject>Image reconstruction</subject><subject>Optical imaging</subject><subject>Target tracking</subject><subject>Task analysis</subject><subject>Training</subject><subject>unsupervised tracking</subject><subject>video inpainting</subject><subject>Video object removal</subject><subject>Visualization</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkL1PwzAQRy0EoqWwMyCUkSXlHDuxPaKKj0oVRahljRz7glIlcbCbSvz3pGpBTHfD-73hEXJNYUopqPvV_G2aQMKmjHHIID0hY6o4jQF4cjr8kIpYUK5G5CKEDQDlKc3OyYhJUCBTOiav6zb0HfpdFdBGK2w653UdzZz3GDrXWmwNRgvUvq3az6h0Plq3VVkN8Edl0UXLYoNmG71j43a6viRnpa4DXh3vhKyfHlezl3ixfJ7PHhaxoVJuYwOFlJyXFDUrrBASqEhT1CVKNJoxoDoppJUsAStKVNIay5hkmAkEQRmbkLuDt_Puq8ewzZsqGKxr3aLrQ54oSJRgPKMDCgfUeBeCxzLvfNVo_51TyPcV86Fivq-YHysOk9ujvS8atH-D32wDcHMAKkT852Op4kKwH8FHdks</recordid><startdate>20231213</startdate><enddate>20231213</enddate><creator>Wang, Zhongdao</creator><creator>Wang, Jinglu</creator><creator>Li, Xiao</creator><creator>Li, Ya-Li</creator><creator>Lu, Yan</creator><creator>Wang, Shengjin</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-7809-1932</orcidid><orcidid>https://orcid.org/0000-0002-6629-7228</orcidid><orcidid>https://orcid.org/0000-0002-4483-8783</orcidid><orcidid>https://orcid.org/0000-0001-5383-6424</orcidid><orcidid>https://orcid.org/0000-0002-3222-6579</orcidid></search><sort><creationdate>20231213</creationdate><title>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</title><author>Wang, Zhongdao ; Wang, Jinglu ; Li, Xiao ; Li, Ya-Li ; Lu, Yan ; Wang, Shengjin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c188t-c0b8844f1ea3bd77801755eafe8eca3301a2b8d8320d7fe98dcd3383e67e07133</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Annotations</topic><topic>correspondence learning</topic><topic>Image reconstruction</topic><topic>Optical imaging</topic><topic>Target tracking</topic><topic>Task analysis</topic><topic>Training</topic><topic>unsupervised tracking</topic><topic>video inpainting</topic><topic>Video object removal</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Zhongdao</creatorcontrib><creatorcontrib>Wang, Jinglu</creatorcontrib><creatorcontrib>Li, Xiao</creatorcontrib><creatorcontrib>Li, Ya-Li</creatorcontrib><creatorcontrib>Lu, Yan</creatorcontrib><creatorcontrib>Wang, Shengjin</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Zhongdao</au><au>Wang, Jinglu</au><au>Li, Xiao</au><au>Li, Ya-Li</au><au>Lu, Yan</au><au>Wang, Shengjin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised Temporal Correspondence Learning for Unified Video Object Removal</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2023-12-13</date><risdate>2023</risdate><volume>PP</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Video object removal aims at erasing a target object in the entire video and filling holes with plausible contents, given an object mask in the first frame as input. Existing solutions mostly break down the task into (supervised) mask tracking and (self-supervised) video completion, and then separately tackle them with tailored designs. In this paper, we introduce a new setup, coined as unified video object removal , where mask tracking and completion are addressed within a unified framework. Despite introducing more challenges, the setup is promising for future practical usage. We embrace the observation that these two sub-tasks have strong inherent connections in terms of pixel-level temporal correspondence. Making full use of the connections could be beneficial considering the complexity of both algorithm and deployment. We propose a single network linking the two sub-tasks by inferring temporal correspondences across multiple frames, i.e ., correspondences between valid-valid (V-V) pixel pairs for mask tracking and correspondences between valid-hole (V-H) pixel pairs for video completion. Thanks to the unified setup, the network can be learned end-to-end in a totally unsupervised fashion without any annotations. We demonstrate that our method can generate visually pleasing results and perform favorably against existing separate solutions in realistic test cases.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>38090851</pmid><doi>10.1109/TIP.2023.3340605</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0001-7809-1932</orcidid><orcidid>https://orcid.org/0000-0002-6629-7228</orcidid><orcidid>https://orcid.org/0000-0002-4483-8783</orcidid><orcidid>https://orcid.org/0000-0001-5383-6424</orcidid><orcidid>https://orcid.org/0000-0002-3222-6579</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1057-7149 |
ispartof | IEEE transactions on image processing, 2023-12, Vol.PP, p.1-1 |
issn | 1057-7149 1941-0042 |
language | eng |
recordid | cdi_proquest_miscellaneous_2902973461 |
source | IEEE Electronic Library (IEL) |
subjects | Annotations correspondence learning Image reconstruction Optical imaging Target tracking Task analysis Training unsupervised tracking video inpainting Video object removal Visualization |
title | Unsupervised Temporal Correspondence Learning for Unified Video Object Removal |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T18%3A37%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20Temporal%20Correspondence%20Learning%20for%20Unified%20Video%20Object%20Removal&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Wang,%20Zhongdao&rft.date=2023-12-13&rft.volume=PP&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2023.3340605&rft_dat=%3Cproquest_RIE%3E2902973461%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2902973461&rft_id=info:pmid/38090851&rft_ieee_id=10359477&rfr_iscdi=true |