CLIPTER: Looking at the Bigger Picture in Scene Text Recognition

Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-07
Hauptverfasser:	Aberdam, Aviad, Bensaïd, David, Golts, Alona, Ganz, Roy, Nuriel, Oren, Tichauer, Royee, Mazor, Shai, Litman, Ron
Format:	Artikel
Sprache:	eng
Schlagworte:	Feature recognition Representations
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Aberdam, Aviad Bensaïd, David Golts, Alona Ganz, Roy Nuriel, Oren Tichauer, Royee Mazor, Shai Litman, Ron
description	Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2766882377</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2766882377</sourcerecordid><originalsourceid>FETCH-proquest_journals_27668823773</originalsourceid><addsrcrecordid>eNqNykELgjAYgOERBEn5Hz7oLNi3dNIpEqPAg9juIvK1ZrHVNqGfX4d-QKf38LwzFiHnm6TYIi5Y7P2YpinmArOMR2xf1udGVu0Oamvv2ijoA4QbwUErRQ4aPYTJEWgDl4EMgaR3gJYGq4wO2poVm1_7h6f41yVbHytZnpKns6-JfOhGOznzpQ5FnhcFciH4f9cHXpo3og</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2766882377</pqid></control><display><type>article</type><title>CLIPTER: Looking at the Bigger Picture in Scene Text Recognition</title><source>Freely Accessible Journals</source><creator>Aberdam, Aviad ; Bensaïd, David ; Golts, Alona ; Ganz, Roy ; Nuriel, Oren ; Tichauer, Royee ; Mazor, Shai ; Litman, Ron</creator><creatorcontrib>Aberdam, Aviad ; Bensaïd, David ; Golts, Alona ; Ganz, Roy ; Nuriel, Oren ; Tichauer, Royee ; Mazor, Shai ; Litman, Ron</creatorcontrib><description>Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Feature recognition ; Representations</subject><ispartof>arXiv.org, 2023-07</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Aberdam, Aviad</creatorcontrib><creatorcontrib>Bensaïd, David</creatorcontrib><creatorcontrib>Golts, Alona</creatorcontrib><creatorcontrib>Ganz, Roy</creatorcontrib><creatorcontrib>Nuriel, Oren</creatorcontrib><creatorcontrib>Tichauer, Royee</creatorcontrib><creatorcontrib>Mazor, Shai</creatorcontrib><creatorcontrib>Litman, Ron</creatorcontrib><title>CLIPTER: Looking at the Bigger Picture in Scene Text Recognition</title><title>arXiv.org</title><description>Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.</description><subject>Feature recognition</subject><subject>Representations</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNykELgjAYgOERBEn5Hz7oLNi3dNIpEqPAg9juIvK1ZrHVNqGfX4d-QKf38LwzFiHnm6TYIi5Y7P2YpinmArOMR2xf1udGVu0Oamvv2ijoA4QbwUErRQ4aPYTJEWgDl4EMgaR3gJYGq4wO2poVm1_7h6f41yVbHytZnpKns6-JfOhGOznzpQ5FnhcFciH4f9cHXpo3og</recordid><startdate>20230723</startdate><enddate>20230723</enddate><creator>Aberdam, Aviad</creator><creator>Bensaïd, David</creator><creator>Golts, Alona</creator><creator>Ganz, Roy</creator><creator>Nuriel, Oren</creator><creator>Tichauer, Royee</creator><creator>Mazor, Shai</creator><creator>Litman, Ron</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230723</creationdate><title>CLIPTER: Looking at the Bigger Picture in Scene Text Recognition</title><author>Aberdam, Aviad ; Bensaïd, David ; Golts, Alona ; Ganz, Roy ; Nuriel, Oren ; Tichauer, Royee ; Mazor, Shai ; Litman, Ron</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27668823773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Feature recognition</topic><topic>Representations</topic><toplevel>online_resources</toplevel><creatorcontrib>Aberdam, Aviad</creatorcontrib><creatorcontrib>Bensaïd, David</creatorcontrib><creatorcontrib>Golts, Alona</creatorcontrib><creatorcontrib>Ganz, Roy</creatorcontrib><creatorcontrib>Nuriel, Oren</creatorcontrib><creatorcontrib>Tichauer, Royee</creatorcontrib><creatorcontrib>Mazor, Shai</creatorcontrib><creatorcontrib>Litman, Ron</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Aberdam, Aviad</au><au>Bensaïd, David</au><au>Golts, Alona</au><au>Ganz, Roy</au><au>Nuriel, Oren</au><au>Tichauer, Royee</au><au>Mazor, Shai</au><au>Litman, Ron</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>CLIPTER: Looking at the Bigger Picture in Scene Text Recognition</atitle><jtitle>arXiv.org</jtitle><date>2023-07-23</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-07
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2766882377
source	Freely Accessible Journals
subjects	Feature recognition Representations
title	CLIPTER: Looking at the Bigger Picture in Scene Text Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T08%3A36%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=CLIPTER:%20Looking%20at%20the%20Bigger%20Picture%20in%20Scene%20Text%20Recognition&rft.jtitle=arXiv.org&rft.au=Aberdam,%20Aviad&rft.date=2023-07-23&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2766882377%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2766882377&rft_id=info:pmid/&rfr_iscdi=true