LiLTv2: Language-substitutable Layout-Image Transformer for Visual Information Extraction
Visual information extraction (VIE) has experienced substantial growth and heightened interest due to its pivotal role in intelligent document processing. However, most existing related pre-trained models typically can only process the data from a certain (set of) language(s)—often just English, rep...
Gespeichert in:
Veröffentlicht in: | ACM transactions on multimedia computing communications and applications 2024-12 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Visual information extraction (VIE) has experienced substantial growth and heightened interest due to its pivotal role in intelligent document processing. However, most existing related pre-trained models typically can only process the data from a certain (set of) language(s)—often just English, representing a distinct limitation. To solve it, we present a Language-substitutable Layout-Image Transformer (LiLTv2). It can be pre-trained just once on monolingual documents and then collaborate with off-the-shelf textual models in other languages during fine-tuning. Firstly, LiLTv2 utilizes a new dual-stream model architecture, one stream for substitutable text information and the other for layout and image information. Then, LiLTv2 has improved upon the optimization strategy and the diverse tasks adopted in the pre-training stage. Finally, we innovatively propose a teacher-student knowledge distillation learning with segment-level multi-modal features named SegKD. Extensive experimental results on widely-used benchmarks can demonstrate the superior effectiveness of our method. |
---|---|
ISSN: | 1551-6857 1551-6865 |
DOI: | 10.1145/3708351 |