Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt

Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-11
Hauptverfasser:	Lin, Xingtao, Qiu, Heqian, Wang, Lanxiao, Wang, Ruihang, Xu, Linfeng, Li, Hongliang
Format:	Artikel
Sprache:	eng
Schlagworte:	Feature maps Scale models
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Lin, Xingtao Qiu, Heqian Wang, Lanxiao Wang, Ruihang Xu, Linfeng Li, Hongliang
description	Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text''. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3108441702</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3108441702</sourcerecordid><originalsourceid>FETCH-proquest_journals_31084417023</originalsourceid><addsrcrecordid>eNqNjNEKgjAYRkcQJOU7_NC1MLeZ0m0lXUataxH9k4lttk2Inj4rH6Crc3HO981IwDiPo0wwtiChcy2llG1SliQ8IPKMjTIaTtbcew9y0Eo3W8iVxqix5YgaLhVqBIlPD3v0WPnP4OpVp15jDNPD1_9uVmR-KzuH4cQlWecHuTtGvTWPAZ0vWjNYPaqCxzQTIk4p4_9VbzccP-8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3108441702</pqid></control><display><type>article</type><title>Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt</title><source>Free E- Journals</source><creator>Lin, Xingtao ; Qiu, Heqian ; Wang, Lanxiao ; Wang, Ruihang ; Xu, Linfeng ; Li, Hongliang</creator><creatorcontrib>Lin, Xingtao ; Qiu, Heqian ; Wang, Lanxiao ; Wang, Ruihang ; Xu, Linfeng ; Li, Hongliang</creatorcontrib><description>Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text''. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Feature maps ; Scale models</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Lin, Xingtao</creatorcontrib><creatorcontrib>Qiu, Heqian</creatorcontrib><creatorcontrib>Wang, Lanxiao</creatorcontrib><creatorcontrib>Wang, Ruihang</creatorcontrib><creatorcontrib>Xu, Linfeng</creatorcontrib><creatorcontrib>Li, Hongliang</creatorcontrib><title>Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt</title><title>arXiv.org</title><description>Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text''. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.</description><subject>Feature maps</subject><subject>Scale models</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjNEKgjAYRkcQJOU7_NC1MLeZ0m0lXUataxH9k4lttk2Inj4rH6Crc3HO981IwDiPo0wwtiChcy2llG1SliQ8IPKMjTIaTtbcew9y0Eo3W8iVxqix5YgaLhVqBIlPD3v0WPnP4OpVp15jDNPD1_9uVmR-KzuH4cQlWecHuTtGvTWPAZ0vWjNYPaqCxzQTIk4p4_9VbzccP-8</recordid><startdate>20241119</startdate><enddate>20241119</enddate><creator>Lin, Xingtao</creator><creator>Qiu, Heqian</creator><creator>Wang, Lanxiao</creator><creator>Wang, Ruihang</creator><creator>Xu, Linfeng</creator><creator>Li, Hongliang</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241119</creationdate><title>Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt</title><author>Lin, Xingtao ; Qiu, Heqian ; Wang, Lanxiao ; Wang, Ruihang ; Xu, Linfeng ; Li, Hongliang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31084417023</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Feature maps</topic><topic>Scale models</topic><toplevel>online_resources</toplevel><creatorcontrib>Lin, Xingtao</creatorcontrib><creatorcontrib>Qiu, Heqian</creatorcontrib><creatorcontrib>Wang, Lanxiao</creatorcontrib><creatorcontrib>Wang, Ruihang</creatorcontrib><creatorcontrib>Xu, Linfeng</creatorcontrib><creatorcontrib>Li, Hongliang</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lin, Xingtao</au><au>Qiu, Heqian</au><au>Wang, Lanxiao</au><au>Wang, Ruihang</au><au>Xu, Linfeng</au><au>Li, Hongliang</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt</atitle><jtitle>arXiv.org</jtitle><date>2024-11-19</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder's input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text''. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3108441702
source	Free E- Journals
subjects	Feature maps Scale models
title	Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T17%3A16%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Region%20Prompt%20Tuning:%20Fine-grained%20Scene%20Text%20Detection%20Utilizing%20Region%20Text%20Prompt&rft.jtitle=arXiv.org&rft.au=Lin,%20Xingtao&rft.date=2024-11-19&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3108441702%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3108441702&rft_id=info:pmid/&rfr_iscdi=true