Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition
In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the ling...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In text recognition, self-supervised pre-training emerges as a good solution
to reduce dependence on expansive annotated real data. Previous studies
primarily focus on local visual representation by leveraging mask image
modeling or sequence contrastive learning. However, they omit modeling the
linguistic information in text images, which is crucial for recognizing text.
To simultaneously capture local character features and linguistic information
in visual space, we propose Symmetric Superimposition Modeling (SSM). The
objective of SSM is to reconstruct the direction-specific pixel and feature
signals from the symmetrically superimposed input. Specifically, we add the
original image with its inverted views to create the symmetrically superimposed
inputs. At the pixel level, we reconstruct the original and inverted images to
capture character shapes and texture-level linguistic context. At the feature
level, we reconstruct the feature of the same original image and inverted image
with different augmentations to model the semantic-level linguistic context and
the local character discrimination. In our design, we disrupt the character
shape and linguistic rules. Consequently, the dual-level reconstruction
facilitates understanding character shapes and linguistic information from the
perspective of visual texture and feature semantics. Experiments on various
text recognition benchmarks demonstrate the effectiveness and generality of
SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average
word accuracy on Union14M benchmarks. The code is available at
https://github.com/FaltingsA/SSM. |
---|---|
DOI: | 10.48550/arxiv.2405.05841 |