Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling
Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Wang, Zixiao Xie, Hongtao Wang, YuXin Qu, Yadong Guo, Fengjun Liu, Pengwei |
description | Existing scene text removal (STR) task suffers from insufficient training
data due to the expensive pixel-level labeling. In this paper, we aim to
address this issue by introducing a Text-aware Masked Image Modeling algorithm
(TMIM), which can pretrain STR models with low-cost text detection labels
(e.g., text bounding box). Different from previous pretraining methods that use
indirect auxiliary tasks only to enhance the implicit feature extraction
ability, our TMIM first enables the STR task to be directly trained in a weakly
supervised manner, which explores the STR knowledge explicitly and efficiently.
In TMIM, first, a Background Modeling stream is built to learn background
generation rules by recovering the masked non-text region. Meanwhile, it
provides pseudo STR labels on the masked text region. Second, a Text Erasing
stream is proposed to learn from the pseudo labels and equip the model with
end-to-end STR ability. Benefiting from the two collaborative streams, our STR
model can achieve impressive performance only with the public text detection
datasets, which greatly alleviates the limitation of the high-cost STR labels.
Experiments demonstrate that our method outperforms other pretrain methods and
achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will
be available at https://github.com/wzx99/TMIM. |
doi_str_mv | 10.48550/arxiv.2409.13431 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_13431</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_13431</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_134313</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NjE25GSI8EktSy1KTM_MS1cISa0oUfDJT07MyaxKLMnMz1NIyy9SCE5OzUuFyAWl5uaXJeYolGUmggV0E8sTi1IVfBOLs1NTFDxzE9OBnPyU1BygaTwMrGmJOcWpvFCam0HezTXE2UMX7Ib4gqLM3MSiyniQW-LBbjEmrAIA8g8_CA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</title><source>arXiv.org</source><creator>Wang, Zixiao ; Xie, Hongtao ; Wang, YuXin ; Qu, Yadong ; Guo, Fengjun ; Liu, Pengwei</creator><creatorcontrib>Wang, Zixiao ; Xie, Hongtao ; Wang, YuXin ; Qu, Yadong ; Guo, Fengjun ; Liu, Pengwei</creatorcontrib><description>Existing scene text removal (STR) task suffers from insufficient training
data due to the expensive pixel-level labeling. In this paper, we aim to
address this issue by introducing a Text-aware Masked Image Modeling algorithm
(TMIM), which can pretrain STR models with low-cost text detection labels
(e.g., text bounding box). Different from previous pretraining methods that use
indirect auxiliary tasks only to enhance the implicit feature extraction
ability, our TMIM first enables the STR task to be directly trained in a weakly
supervised manner, which explores the STR knowledge explicitly and efficiently.
In TMIM, first, a Background Modeling stream is built to learn background
generation rules by recovering the masked non-text region. Meanwhile, it
provides pseudo STR labels on the masked text region. Second, a Text Erasing
stream is proposed to learn from the pseudo labels and equip the model with
end-to-end STR ability. Benefiting from the two collaborative streams, our STR
model can achieve impressive performance only with the public text detection
datasets, which greatly alleviates the limitation of the high-cost STR labels.
Experiments demonstrate that our method outperforms other pretrain methods and
achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will
be available at https://github.com/wzx99/TMIM.</description><identifier>DOI: 10.48550/arxiv.2409.13431</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.13431$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.13431$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Zixiao</creatorcontrib><creatorcontrib>Xie, Hongtao</creatorcontrib><creatorcontrib>Wang, YuXin</creatorcontrib><creatorcontrib>Qu, Yadong</creatorcontrib><creatorcontrib>Guo, Fengjun</creatorcontrib><creatorcontrib>Liu, Pengwei</creatorcontrib><title>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</title><description>Existing scene text removal (STR) task suffers from insufficient training
data due to the expensive pixel-level labeling. In this paper, we aim to
address this issue by introducing a Text-aware Masked Image Modeling algorithm
(TMIM), which can pretrain STR models with low-cost text detection labels
(e.g., text bounding box). Different from previous pretraining methods that use
indirect auxiliary tasks only to enhance the implicit feature extraction
ability, our TMIM first enables the STR task to be directly trained in a weakly
supervised manner, which explores the STR knowledge explicitly and efficiently.
In TMIM, first, a Background Modeling stream is built to learn background
generation rules by recovering the masked non-text region. Meanwhile, it
provides pseudo STR labels on the masked text region. Second, a Text Erasing
stream is proposed to learn from the pseudo labels and equip the model with
end-to-end STR ability. Benefiting from the two collaborative streams, our STR
model can achieve impressive performance only with the public text detection
datasets, which greatly alleviates the limitation of the high-cost STR labels.
Experiments demonstrate that our method outperforms other pretrain methods and
achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will
be available at https://github.com/wzx99/TMIM.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DM0NjE25GSI8EktSy1KTM_MS1cISa0oUfDJT07MyaxKLMnMz1NIyy9SCE5OzUuFyAWl5uaXJeYolGUmggV0E8sTi1IVfBOLs1NTFDxzE9OBnPyU1BygaTwMrGmJOcWpvFCam0HezTXE2UMX7Ib4gqLM3MSiyniQW-LBbjEmrAIA8g8_CA</recordid><startdate>20240920</startdate><enddate>20240920</enddate><creator>Wang, Zixiao</creator><creator>Xie, Hongtao</creator><creator>Wang, YuXin</creator><creator>Qu, Yadong</creator><creator>Guo, Fengjun</creator><creator>Liu, Pengwei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240920</creationdate><title>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</title><author>Wang, Zixiao ; Xie, Hongtao ; Wang, YuXin ; Qu, Yadong ; Guo, Fengjun ; Liu, Pengwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_134313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Zixiao</creatorcontrib><creatorcontrib>Xie, Hongtao</creatorcontrib><creatorcontrib>Wang, YuXin</creatorcontrib><creatorcontrib>Qu, Yadong</creatorcontrib><creatorcontrib>Guo, Fengjun</creatorcontrib><creatorcontrib>Liu, Pengwei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Zixiao</au><au>Xie, Hongtao</au><au>Wang, YuXin</au><au>Qu, Yadong</au><au>Guo, Fengjun</au><au>Liu, Pengwei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling</atitle><date>2024-09-20</date><risdate>2024</risdate><abstract>Existing scene text removal (STR) task suffers from insufficient training
data due to the expensive pixel-level labeling. In this paper, we aim to
address this issue by introducing a Text-aware Masked Image Modeling algorithm
(TMIM), which can pretrain STR models with low-cost text detection labels
(e.g., text bounding box). Different from previous pretraining methods that use
indirect auxiliary tasks only to enhance the implicit feature extraction
ability, our TMIM first enables the STR task to be directly trained in a weakly
supervised manner, which explores the STR knowledge explicitly and efficiently.
In TMIM, first, a Background Modeling stream is built to learn background
generation rules by recovering the masked non-text region. Meanwhile, it
provides pseudo STR labels on the masked text region. Second, a Text Erasing
stream is proposed to learn from the pseudo labels and equip the model with
end-to-end STR ability. Benefiting from the two collaborative streams, our STR
model can achieve impressive performance only with the public text detection
datasets, which greatly alleviates the limitation of the high-cost STR labels.
Experiments demonstrate that our method outperforms other pretrain methods and
achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will
be available at https://github.com/wzx99/TMIM.</abstract><doi>10.48550/arxiv.2409.13431</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2409.13431 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2409_13431 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T21%3A34%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Leveraging%20Text%20Localization%20for%20Scene%20Text%20Removal%20via%20Text-aware%20Masked%20Image%20Modeling&rft.au=Wang,%20Zixiao&rft.date=2024-09-20&rft_id=info:doi/10.48550/arxiv.2409.13431&rft_dat=%3Carxiv_GOX%3E2409_13431%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |