VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Text spotting, a task involving the extraction of textual information from
image or video sequences, faces challenges in cross-domain adaption, such as
image-to-image and image-to-video generalization. In this paper, we introduce a
new method, termed VimTS, which enhances the generalization ability of the
model by achieving better synergy among different tasks. Typically, we propose
a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively
convert the original single-task model into a multi-task model suitable for
both image and video scenarios with minimal additional parameters. The Prompt
Queries Generation Module facilitates explicit interaction between different
tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable
features for each task. Additionally, to further enable the model to learn
temporal information at a lower cost, we propose a synthetic video text dataset
(VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
Notably, our method outperforms the state-of-the-art method by an average of
2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and
TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses
the previous end-to-end video spotting method in ICDAR2015 video and DSText v2
by an average of 5.5% on the MOTA metric, using only image-level data. We
further demonstrate that existing Large Multimodal Models exhibit limitations
in generating cross-domain scene text spotting, in contrast to our VimTS model
which requires significantly fewer parameters and data. The code and datasets
will be made available at the https://VimTextSpotter.github.io. |
---|---|
DOI: | 10.48550/arxiv.2404.19652 |