Towards reduced-complexity scene text recognition (RCSTR) through a novel salient feature selection

The integration of an attention mechanism has played a crucial role in many recent scene text recognition (STR) methods. It enables the capture of spatial feature dependencies (known as self-attention) and the identification of relevant features while predicting a character (known as cross-attention...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal on document analysis and recognition 2024, Vol.27 (3), p.289-302
Hauptverfasser:	Buoy, Rina, Iwamura, Masakazu, Srun, Sovila, Kise, Koichi
Format:	Artikel
Sprache:	eng
Schlagworte:	Character recognition Complexity Computer Science Decoders Feature maps Feature recognition Feature selection Image Processing and Computer Vision Pattern Recognition Special Issue Paper
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The integration of an attention mechanism has played a crucial role in many recent scene text recognition (STR) methods. It enables the capture of spatial feature dependencies (known as self-attention) and the identification of relevant features while predicting a character (known as cross-attention). However, computations and memory requirements in the self-attention and cross-attention layers increase quadratically and linearly with the feature map size, respectively, leading to a computational bottleneck in low-resource environments. But, is it necessary to attend to the entire feature maps? On the other hand, text in a natural scene is continuous and oriented in a specific direction, and it does not occupy the entire image. Therefore, utilizing only a small salient subset of features in text regions is sufficient for accurately predicting characters. Based on this salient feature selection, we propose a reduced-complexity scene text recognition framework that significantly reduces model complexities and memory requirements in the self-attention and cross-attention layers. We validate the proposed framework by employing a convolutional STR architecture with both connectionist temporal classification and transformer decoders. Through the model complexity and performance analyses on public benchmark datasets, we demonstrate that the proposed method can substantially reduce model complexities while still maintaining reasonably robust recognition accuracy.
ISSN:	1433-2833 1433-2825
DOI:	10.1007/s10032-024-00474-x