UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition
In the digital era, table structure recognition technology is a critical tool for processing and analyzing large volumes of tabular data. Previous methods primarily focus on visual aspects of table structure recovery but often fail to effectively comprehend the textual semantics within tables, parti...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In the digital era, table structure recognition technology is a critical tool
for processing and analyzing large volumes of tabular data. Previous methods
primarily focus on visual aspects of table structure recovery but often fail to
effectively comprehend the textual semantics within tables, particularly for
descriptive textual cells. In this paper, we introduce UniTabNet, a novel
framework for table structure parsing based on the image-to-text model.
UniTabNet employs a ``divide-and-conquer'' strategy, utilizing an image-to-text
model to decouple table cells and integrating both physical and logical
decoders to reconstruct the complete table structure. We further enhance our
framework with the Vision Guider, which directs the model's focus towards
pertinent areas, thereby boosting prediction accuracy. Additionally, we
introduce the Language Guider to refine the model's capability to understand
textual semantics in table images. Evaluated on prominent table structure
datasets such as PubTabNet, PubTables1M, WTW, and iFLYTAB, UniTabNet achieves a
new state-of-the-art performance, demonstrating the efficacy of our approach.
The code will also be made publicly available. |
---|---|
DOI: | 10.48550/arxiv.2409.13148 |