Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing

Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performa...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-05
Hauptverfasser:	Zhang, Boqiang, Xie, Hongtao, Gao, Zuan, Wang, Yuxin
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets Editing Image reconstruction Learning Representations
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Zhang, Boqiang Xie, Hongtao Gao, Zuan Wang, Yuxin
description	Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3052221430</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3052221430</sourcerecordid><originalsourceid>FETCH-proquest_journals_30522214303</originalsourceid><addsrcrecordid>eNqNissKwjAQRYMgKNp_GHCrEJNWxa0PXIgLLYirEtppbakzmqTi51vFD3B1OfecjugrraeTRahUTwTOVVJKNZurKNJ9Ua2uzA7hfDUeLtzAATFbwrp0SN5QUWMGR7xb_LIvmWCPxlJJBeRs4ZQiIcT48m2WckHlpxm3cOOnqcFQBpusPakYim5uaofBbwditN3Eq93kbvnRoPNJxY2lViVaRkqpaail_q96A4xPR8w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3052221430</pqid></control><display><type>article</type><title>Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing</title><source>Free E- Journals</source><creator>Zhang, Boqiang ; Xie, Hongtao ; Gao, Zuan ; Wang, Yuxin</creator><creatorcontrib>Zhang, Boqiang ; Xie, Hongtao ; Gao, Zuan ; Wang, Yuxin</creatorcontrib><description>Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Editing ; Image reconstruction ; Learning ; Representations</subject><ispartof>arXiv.org, 2024-05</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Zhang, Boqiang</creatorcontrib><creatorcontrib>Xie, Hongtao</creatorcontrib><creatorcontrib>Gao, Zuan</creatorcontrib><creatorcontrib>Wang, Yuxin</creatorcontrib><title>Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing</title><title>arXiv.org</title><description>Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.</description><subject>Datasets</subject><subject>Editing</subject><subject>Image reconstruction</subject><subject>Learning</subject><subject>Representations</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNissKwjAQRYMgKNp_GHCrEJNWxa0PXIgLLYirEtppbakzmqTi51vFD3B1OfecjugrraeTRahUTwTOVVJKNZurKNJ9Ua2uzA7hfDUeLtzAATFbwrp0SN5QUWMGR7xb_LIvmWCPxlJJBeRs4ZQiIcT48m2WckHlpxm3cOOnqcFQBpusPakYim5uaofBbwditN3Eq93kbvnRoPNJxY2lViVaRkqpaail_q96A4xPR8w</recordid><startdate>20240507</startdate><enddate>20240507</enddate><creator>Zhang, Boqiang</creator><creator>Xie, Hongtao</creator><creator>Gao, Zuan</creator><creator>Wang, Yuxin</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240507</creationdate><title>Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing</title><author>Zhang, Boqiang ; Xie, Hongtao ; Gao, Zuan ; Wang, Yuxin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30522214303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Datasets</topic><topic>Editing</topic><topic>Image reconstruction</topic><topic>Learning</topic><topic>Representations</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Boqiang</creatorcontrib><creatorcontrib>Xie, Hongtao</creatorcontrib><creatorcontrib>Gao, Zuan</creatorcontrib><creatorcontrib>Wang, Yuxin</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Boqiang</au><au>Xie, Hongtao</au><au>Gao, Zuan</au><au>Wang, Yuxin</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing</atitle><jtitle>arXiv.org</jtitle><date>2024-05-07</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-05
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3052221430
source	Free E- Journals
subjects	Datasets Editing Image reconstruction Learning Representations
title	Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T00%3A15%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Choose%20What%20You%20Need:%20Disentangled%20Representation%20Learning%20for%20Scene%20Text%20Recognition,%20Removal%20and%20Editing&rft.jtitle=arXiv.org&rft.au=Zhang,%20Boqiang&rft.date=2024-05-07&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3052221430%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3052221430&rft_id=info:pmid/&rfr_iscdi=true