Towards reduced-complexity scene text recognition (RCSTR) through a novel salient feature selection
The integration of an attention mechanism has played a crucial role in many recent scene text recognition (STR) methods. It enables the capture of spatial feature dependencies (known as self-attention) and the identification of relevant features while predicting a character (known as cross-attention...
Gespeichert in:
Veröffentlicht in: | International journal on document analysis and recognition 2024, Vol.27 (3), p.289-302 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The integration of an attention mechanism has played a crucial role in many recent scene text recognition (STR) methods. It enables the capture of spatial feature dependencies (known as self-attention) and the identification of relevant features while predicting a character (known as cross-attention). However, computations and memory requirements in the self-attention and cross-attention layers increase quadratically and linearly with the feature map size, respectively, leading to a computational bottleneck in low-resource environments. But, is it necessary to attend to the entire feature maps? On the other hand, text in a natural scene is continuous and oriented in a specific direction, and it does not occupy the entire image. Therefore, utilizing only a small salient subset of features in text regions is sufficient for accurately predicting characters. Based on this salient feature selection, we propose a reduced-complexity scene text recognition framework that significantly reduces model complexities and memory requirements in the self-attention and cross-attention layers. We validate the proposed framework by employing a convolutional STR architecture with both connectionist temporal classification and transformer decoders. Through the model complexity and performance analyses on public benchmark datasets, we demonstrate that the proposed method can substantially reduce model complexities while still maintaining reasonably robust recognition accuracy. |
---|---|
ISSN: | 1433-2833 1433-2825 |
DOI: | 10.1007/s10032-024-00474-x |