Real-time Lexicon-free Scene Text Retrieval

•Improved method that achieves state of the art performance in image-based text retrieval.•Results on different CNN backbones modified to predict a PHOC of detected textual instances are presented.•Effect of different PHOC dimensions is explored and analyzed.•PHOC embedding allows retrieving out-of-...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition 2021-02, Vol.110, p.107656, Article 107656
Hauptverfasser:	Mafla, Andrés, Tito, Rubèn, Dey, Sounak, Gómez, Lluís, Rusiñol, Marçal, Valveny, Ernest, Karatzas, Dimosthenis
Format:	Artikel
Sprache:	eng
Schlagworte:	Convolutional neural networks Image retrieval PHOC Region proposal networks Scene text detection Scene text recognition Word spotting
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Improved method that achieves state of the art performance in image-based text retrieval.•Results on different CNN backbones modified to predict a PHOC of detected textual instances are presented.•Effect of different PHOC dimensions is explored and analyzed.•PHOC embedding allows retrieving out-of-vocabulary words unseen at training time.•Proposed method achieves state of the art in multilingual dataset of unseen samples at training time.•Method is faster than state of the art, allowing real-time retrieval in videos. In this work, we address the task of scene text retrieval: given a text query, the system returns all images containing the queried text. The proposed model uses a single shot CNN architecture that predicts bounding boxes and builds a compact representation of spotted words. In this way, this problem can be modeled as a nearest neighbor search of the textual representation of a query over the outputs of the CNN collected from the totality of an image database. Our experiments demonstrate that the proposed model outperforms previous state-of-the-art, while offering a significant increase in processing speed and unmatched expressiveness with samples never seen at training time. Several experiments to assess the generalization capability of the model are conducted in a multilingual dataset, as well as an application of real-time text spotting in videos.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2020.107656