GRAM: Global Reasoning for Multi-Page VQA
The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamle...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The increasing use of transformer-based large language models brings forward
the challenge of processing long sequences. In document visual question
answering (DocVQA), leading methods focus on the single-page setting, while
documents can span hundreds of pages. We present GRAM, a method that seamlessly
extends pre-trained single-page models to the multi-page setting, without
requiring computationally-heavy pretraining. To do so, we leverage a
single-page encoder for local page-level understanding, and enhance it with
document-level designated layers and learnable tokens, facilitating the flow of
information across pages for global reasoning. To enforce our model to utilize
the newly introduced document tokens, we propose a tailored bias adaptation
method. For additional computational savings during decoding, we introduce an
optional compression stage using our compression-transformer
(C-Former),reducing the encoded sequence length, thereby allowing a tradeoff
between quality and latency. Extensive experiments showcase GRAM's
state-of-the-art performance on the benchmarks for multi-page DocVQA,
demonstrating the effectiveness of our approach. |
---|---|
DOI: | 10.48550/arxiv.2401.03411 |