Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Wang, Zixiao, Xie, Hongtao, Wang, YuXin, Qu, Yadong, Guo, Fengjun, Liu, Pengwei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Wang, Zixiao
Xie, Hongtao
Wang, YuXin
Qu, Yadong
Guo, Fengjun
Liu, Pengwei
description Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.
doi_str_mv 10.48550/arxiv.2409.13431
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_13431</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_13431</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_134313</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NjE25GSI8EktSy1KTM_MS1cISa0oUfDJT07MyaxKLMnMz1NIyy9SCE5OzUuFyAWl5uaXJeYolGUmggV0E8sTi1IVfBOLs1NTFDxzE9OBnPyU1BygaTwMrGmJOcWpvFCam0HezTXE2UMX7Ib4gqLM3MSiyniQW-LBbjEmrAIA8g8_CA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</title><source>arXiv.org</source><creator>Wang, Zixiao ; Xie, Hongtao ; Wang, YuXin ; Qu, Yadong ; Guo, Fengjun ; Liu, Pengwei</creator><creatorcontrib>Wang, Zixiao ; Xie, Hongtao ; Wang, YuXin ; Qu, Yadong ; Guo, Fengjun ; Liu, Pengwei</creatorcontrib><description>Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.</description><identifier>DOI: 10.48550/arxiv.2409.13431</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.13431$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.13431$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Zixiao</creatorcontrib><creatorcontrib>Xie, Hongtao</creatorcontrib><creatorcontrib>Wang, YuXin</creatorcontrib><creatorcontrib>Qu, Yadong</creatorcontrib><creatorcontrib>Guo, Fengjun</creatorcontrib><creatorcontrib>Liu, Pengwei</creatorcontrib><title>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</title><description>Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NjE25GSI8EktSy1KTM_MS1cISa0oUfDJT07MyaxKLMnMz1NIyy9SCE5OzUuFyAWl5uaXJeYolGUmggV0E8sTi1IVfBOLs1NTFDxzE9OBnPyU1BygaTwMrGmJOcWpvFCam0HezTXE2UMX7Ib4gqLM3MSiyniQW-LBbjEmrAIA8g8_CA</recordid><startdate>20240920</startdate><enddate>20240920</enddate><creator>Wang, Zixiao</creator><creator>Xie, Hongtao</creator><creator>Wang, YuXin</creator><creator>Qu, Yadong</creator><creator>Guo, Fengjun</creator><creator>Liu, Pengwei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240920</creationdate><title>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</title><author>Wang, Zixiao ; Xie, Hongtao ; Wang, YuXin ; Qu, Yadong ; Guo, Fengjun ; Liu, Pengwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_134313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Zixiao</creatorcontrib><creatorcontrib>Xie, Hongtao</creatorcontrib><creatorcontrib>Wang, YuXin</creatorcontrib><creatorcontrib>Qu, Yadong</creatorcontrib><creatorcontrib>Guo, Fengjun</creatorcontrib><creatorcontrib>Liu, Pengwei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Zixiao</au><au>Xie, Hongtao</au><au>Wang, YuXin</au><au>Qu, Yadong</au><au>Guo, Fengjun</au><au>Liu, Pengwei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</atitle><date>2024-09-20</date><risdate>2024</risdate><abstract>Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.</abstract><doi>10.48550/arxiv.2409.13431</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2409.13431
ispartof
issn
language eng
recordid cdi_arxiv_primary_2409_13431
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T21%3A34%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Leveraging%20Text%20Localization%20for%20Scene%20Text%20Removal%20via%20Text-aware%20Masked%20Image%20Modeling&rft.au=Wang,%20Zixiao&rft.date=2024-09-20&rft_id=info:doi/10.48550/arxiv.2409.13431&rft_dat=%3Carxiv_GOX%3E2409_13431%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true