DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection

The prosperity of deep learning contributes to the rapid progress in scene text detection. Among all the methods with convolutional networks, segmentation-based ones have drawn extensive attention due to their superiority in detecting text instances of arbitrary shapes and extreme aspect ratios. How...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Lin, Jingyu, Jiang, Jie, Yan, Yan, Guo, Chunchao, Wang, Hongfa, Liu, Wei, Wang, Hanzi
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Lin, Jingyu Jiang, Jie Yan, Yan Guo, Chunchao Wang, Hongfa Liu, Wei Wang, Hanzi
description	The prosperity of deep learning contributes to the rapid progress in scene text detection. Among all the methods with convolutional networks, segmentation-based ones have drawn extensive attention due to their superiority in detecting text instances of arbitrary shapes and extreme aspect ratios. However, the bottom-up methods are limited to the performance of their segmentation models. In this paper, we propose DPTNet (Dual-Path Transformer Network), a simple yet effective architecture to model the global and local information for the scene text detection task. We further propose a parallel design that integrates the convolutional network with a powerful self-attention mechanism to provide complementary clues between the attention path and convolutional path. Moreover, a bi-directional interaction module across the two paths is developed to provide complementary clues in the channel and spatial dimensions. We also upgrade the concentration operation by adding an extra multi-head attention layer to it. Our DPTNet achieves state-of-the-art results on the MSRA-TD500 dataset, and provides competitive results on other standard benchmarks in terms of both detection accuracy and speed.
doi_str_mv	10.48550/arxiv.2208.09878
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2208_09878</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2208_09878</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-541052df9647a193b5cb4d3f8f02b273e3767b3cdca2a4cf411eab0af0c2cbe83</originalsourceid><addsrcrecordid>eNotz81qAjEUBeBsuii2D9BV8wIz5neS6W5wtBakFZr9cJO5wQEdJcaib1-1XR04Bw58hLxwViqrNZtCOg8_pRDMlqy2xj6Sebt2n5jfaEPbE2yLNeQNdQnGY9ynHSbapLAZMoZ8SkivHf0OOCJ1eM60xdsw7Mcn8hBhe8Tn_5wQt5i72bJYfb1_zJpVAZWxhVacadHHulIGeC29Dl71MtrIhBdGojSV8TL0AQSoEBXnCJ5BZEEEj1ZOyOvf7d3RHdKwg3Tpbp7u7pG_809FRQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection</title><source>arXiv.org</source><creator>Lin, Jingyu ; Jiang, Jie ; Yan, Yan ; Guo, Chunchao ; Wang, Hongfa ; Liu, Wei ; Wang, Hanzi</creator><creatorcontrib>Lin, Jingyu ; Jiang, Jie ; Yan, Yan ; Guo, Chunchao ; Wang, Hongfa ; Liu, Wei ; Wang, Hanzi</creatorcontrib><description>The prosperity of deep learning contributes to the rapid progress in scene text detection. Among all the methods with convolutional networks, segmentation-based ones have drawn extensive attention due to their superiority in detecting text instances of arbitrary shapes and extreme aspect ratios. However, the bottom-up methods are limited to the performance of their segmentation models. In this paper, we propose DPTNet (Dual-Path Transformer Network), a simple yet effective architecture to model the global and local information for the scene text detection task. We further propose a parallel design that integrates the convolutional network with a powerful self-attention mechanism to provide complementary clues between the attention path and convolutional path. Moreover, a bi-directional interaction module across the two paths is developed to provide complementary clues in the channel and spatial dimensions. We also upgrade the concentration operation by adding an extra multi-head attention layer to it. Our DPTNet achieves state-of-the-art results on the MSRA-TD500 dataset, and provides competitive results on other standard benchmarks in terms of both detection accuracy and speed.</description><identifier>DOI: 10.48550/arxiv.2208.09878</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2022-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2208.09878$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2208.09878$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lin, Jingyu</creatorcontrib><creatorcontrib>Jiang, Jie</creatorcontrib><creatorcontrib>Yan, Yan</creatorcontrib><creatorcontrib>Guo, Chunchao</creatorcontrib><creatorcontrib>Wang, Hongfa</creatorcontrib><creatorcontrib>Liu, Wei</creatorcontrib><creatorcontrib>Wang, Hanzi</creatorcontrib><title>DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection</title><description>The prosperity of deep learning contributes to the rapid progress in scene text detection. Among all the methods with convolutional networks, segmentation-based ones have drawn extensive attention due to their superiority in detecting text instances of arbitrary shapes and extreme aspect ratios. However, the bottom-up methods are limited to the performance of their segmentation models. In this paper, we propose DPTNet (Dual-Path Transformer Network), a simple yet effective architecture to model the global and local information for the scene text detection task. We further propose a parallel design that integrates the convolutional network with a powerful self-attention mechanism to provide complementary clues between the attention path and convolutional path. Moreover, a bi-directional interaction module across the two paths is developed to provide complementary clues in the channel and spatial dimensions. We also upgrade the concentration operation by adding an extra multi-head attention layer to it. Our DPTNet achieves state-of-the-art results on the MSRA-TD500 dataset, and provides competitive results on other standard benchmarks in terms of both detection accuracy and speed.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81qAjEUBeBsuii2D9BV8wIz5neS6W5wtBakFZr9cJO5wQEdJcaib1-1XR04Bw58hLxwViqrNZtCOg8_pRDMlqy2xj6Sebt2n5jfaEPbE2yLNeQNdQnGY9ynHSbapLAZMoZ8SkivHf0OOCJ1eM60xdsw7Mcn8hBhe8Tn_5wQt5i72bJYfb1_zJpVAZWxhVacadHHulIGeC29Dl71MtrIhBdGojSV8TL0AQSoEBXnCJ5BZEEEj1ZOyOvf7d3RHdKwg3Tpbp7u7pG_809FRQ</recordid><startdate>20220821</startdate><enddate>20220821</enddate><creator>Lin, Jingyu</creator><creator>Jiang, Jie</creator><creator>Yan, Yan</creator><creator>Guo, Chunchao</creator><creator>Wang, Hongfa</creator><creator>Liu, Wei</creator><creator>Wang, Hanzi</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220821</creationdate><title>DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection</title><author>Lin, Jingyu ; Jiang, Jie ; Yan, Yan ; Guo, Chunchao ; Wang, Hongfa ; Liu, Wei ; Wang, Hanzi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-541052df9647a193b5cb4d3f8f02b273e3767b3cdca2a4cf411eab0af0c2cbe83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Lin, Jingyu</creatorcontrib><creatorcontrib>Jiang, Jie</creatorcontrib><creatorcontrib>Yan, Yan</creatorcontrib><creatorcontrib>Guo, Chunchao</creatorcontrib><creatorcontrib>Wang, Hongfa</creatorcontrib><creatorcontrib>Liu, Wei</creatorcontrib><creatorcontrib>Wang, Hanzi</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lin, Jingyu</au><au>Jiang, Jie</au><au>Yan, Yan</au><au>Guo, Chunchao</au><au>Wang, Hongfa</au><au>Liu, Wei</au><au>Wang, Hanzi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection</atitle><date>2022-08-21</date><risdate>2022</risdate><abstract>The prosperity of deep learning contributes to the rapid progress in scene text detection. Among all the methods with convolutional networks, segmentation-based ones have drawn extensive attention due to their superiority in detecting text instances of arbitrary shapes and extreme aspect ratios. However, the bottom-up methods are limited to the performance of their segmentation models. In this paper, we propose DPTNet (Dual-Path Transformer Network), a simple yet effective architecture to model the global and local information for the scene text detection task. We further propose a parallel design that integrates the convolutional network with a powerful self-attention mechanism to provide complementary clues between the attention path and convolutional path. Moreover, a bi-directional interaction module across the two paths is developed to provide complementary clues in the channel and spatial dimensions. We also upgrade the concentration operation by adding an extra multi-head attention layer to it. Our DPTNet achieves state-of-the-art results on the MSRA-TD500 dataset, and provides competitive results on other standard benchmarks in terms of both detection accuracy and speed.</abstract><doi>10.48550/arxiv.2208.09878</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2208.09878
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2208_09878
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T15%3A23%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DPTNet:%20A%20Dual-Path%20Transformer%20Architecture%20for%20Scene%20Text%20Detection&rft.au=Lin,%20Jingyu&rft.date=2022-08-21&rft_id=info:doi/10.48550/arxiv.2208.09878&rft_dat=%3Carxiv_GOX%3E2208_09878%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true