Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wang, Zixiao, Xie, Hongtao, Wang, YuXin, Qu, Yadong, Guo, Fengjun, Liu, Pengwei
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Wang, Zixiao Xie, Hongtao Wang, YuXin Qu, Yadong Guo, Fengjun Liu, Pengwei
description	Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.
doi_str_mv	10.48550/arxiv.2409.13431
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_13431</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_13431</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_134313</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NjE25GSI8EktSy1KTM_MS1cISa0oUfDJT07MyaxKLMnMz1NIyy9SCE5OzUuFyAWl5uaXJeYolGUmggV0E8sTi1IVfBOLs1NTFDxzE9OBnPyU1BygaTwMrGmJOcWpvFCam0HezTXE2UMX7Ib4gqLM3MSiyniQW-LBbjEmrAIA8g8_CA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</title><source>arXiv.org</source><creator>Wang, Zixiao ; Xie, Hongtao ; Wang, YuXin ; Qu, Yadong ; Guo, Fengjun ; Liu, Pengwei</creator><creatorcontrib>Wang, Zixiao ; Xie, Hongtao ; Wang, YuXin ; Qu, Yadong ; Guo, Fengjun ; Liu, Pengwei</creatorcontrib><description>Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.</description><identifier>DOI: 10.48550/arxiv.2409.13431</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.13431$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.13431$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Zixiao</creatorcontrib><creatorcontrib>Xie, Hongtao</creatorcontrib><creatorcontrib>Wang, YuXin</creatorcontrib><creatorcontrib>Qu, Yadong</creatorcontrib><creatorcontrib>Guo, Fengjun</creatorcontrib><creatorcontrib>Liu, Pengwei</creatorcontrib><title>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</title><description>Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NjE25GSI8EktSy1KTM_MS1cISa0oUfDJT07MyaxKLMnMz1NIyy9SCE5OzUuFyAWl5uaXJeYolGUmggV0E8sTi1IVfBOLs1NTFDxzE9OBnPyU1BygaTwMrGmJOcWpvFCam0HezTXE2UMX7Ib4gqLM3MSiyniQW-LBbjEmrAIA8g8_CA</recordid><startdate>20240920</startdate><enddate>20240920</enddate><creator>Wang, Zixiao</creator><creator>Xie, Hongtao</creator><creator>Wang, YuXin</creator><creator>Qu, Yadong</creator><creator>Guo, Fengjun</creator><creator>Liu, Pengwei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240920</creationdate><title>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</title><author>Wang, Zixiao ; Xie, Hongtao ; Wang, YuXin ; Qu, Yadong ; Guo, Fengjun ; Liu, Pengwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_134313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Zixiao</creatorcontrib><creatorcontrib>Xie, Hongtao</creatorcontrib><creatorcontrib>Wang, YuXin</creatorcontrib><creatorcontrib>Qu, Yadong</creatorcontrib><creatorcontrib>Guo, Fengjun</creatorcontrib><creatorcontrib>Liu, Pengwei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Zixiao</au><au>Xie, Hongtao</au><au>Wang, YuXin</au><au>Qu, Yadong</au><au>Guo, Fengjun</au><au>Liu, Pengwei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</atitle><date>2024-09-20</date><risdate>2024</risdate><abstract>Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.</abstract><doi>10.48550/arxiv.2409.13431</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.13431
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_13431
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T21%3A34%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Leveraging%20Text%20Localization%20for%20Scene%20Text%20Removal%20via%20Text-aware%20Masked%20Image%20Modeling&rft.au=Wang,%20Zixiao&rft.date=2024-09-20&rft_id=info:doi/10.48550/arxiv.2409.13431&rft_dat=%3Carxiv_GOX%3E2409_13431%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true