Self-Supervised Pre-Training for Table Structure Recognition Transformer
Table structure recognition (TSR) aims to convert tabular images into a machine-readable format. Although hybrid convolutional neural network (CNN)-transformer architecture is widely used in existing approaches, linear projection transformer has outperformed the hybrid architecture in numerous visio...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Table structure recognition (TSR) aims to convert tabular images into a
machine-readable format. Although hybrid convolutional neural network
(CNN)-transformer architecture is widely used in existing approaches, linear
projection transformer has outperformed the hybrid architecture in numerous
vision tasks due to its simplicity and efficiency. However, existing research
has demonstrated that a direct replacement of CNN backbone with linear
projection leads to a marked performance drop. In this work, we resolve the
issue by proposing a self-supervised pre-training (SSP) method for TSR
transformers. We discover that the performance gap between the linear
projection transformer and the hybrid CNN-transformer can be mitigated by SSP
of the visual encoder in the TSR model. We conducted reproducible ablation
studies and open-sourced our code at https://github.com/poloclub/unitable to
enhance transparency, inspire innovations, and facilitate fair comparisons in
our domain as tables are a promising modality for representation learning. |
---|---|
DOI: | 10.48550/arxiv.2402.15578 |