FlexAttention for Efficient High-Resolution Vision-Language Models
Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose FlexAttention, a flexible attention mechanism for effici...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Current high-resolution vision-language models encode images as
high-resolution image tokens and exhaustively take all these tokens to compute
attention, which significantly increases the computational cost. To address
this problem, we propose FlexAttention, a flexible attention mechanism for
efficient high-resolution vision-language models. Specifically, a
high-resolution image is encoded both as high-resolution tokens and
low-resolution tokens, where only the low-resolution tokens and a few selected
high-resolution tokens are utilized to calculate the attention map, which
greatly shrinks the computational cost. The high-resolution tokens are selected
via a high-resolution selection module which could retrieve tokens of relevant
regions based on an input attention map. The selected high-resolution tokens
are then concatenated to the low-resolution tokens and text tokens, and input
to a hierarchical self-attention layer which produces an attention map that
could be used for the next-step high-resolution token selection. The
hierarchical self-attention process and high-resolution token selection process
are performed iteratively for each attention layer. Experiments on multimodal
benchmarks prove that our FlexAttention outperforms existing high-resolution
VLMs (e.g., relatively ~9% in V* Bench, ~7% in TextVQA), while also
significantly reducing the computational cost by nearly 40%. |
---|---|
DOI: | 10.48550/arxiv.2407.20228 |