Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding
Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly a...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recently, the development of pre-trained vision language foundation models
(VLFMs) has led to remarkable performance in many tasks. However, these models
tend to have strong single-image understanding capability but lack the ability
to understand multiple images. Therefore, they cannot be directly applied to
cope with image change understanding (ICU), which requires models to capture
actual changes between multiple images and describe them in language. In this
paper, we discover that existing VLFMs perform poorly when applied directly to
ICU because of the following problems: (1) VLFMs generally learn the global
representation of a single image, while ICU requires capturing nuances between
multiple images. (2) The ICU performance of VLFMs is significantly affected by
viewpoint variations, which is caused by the altered relationships between
objects when viewpoint changes. To address these problems, we propose a
Viewpoint Integration and Registration method. Concretely, we introduce a fused
adapter image encoder that fine-tunes pre-trained encoders by inserting
designed trainable adapters and fused adapters, to effectively capture nuances
between image pairs. Additionally, a viewpoint registration flow and a semantic
emphasizing module are designed to reduce the performance degradation caused by
viewpoint variations in the visual and semantic space, respectively.
Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our
method achieves state-of-the-art performance in all metrics. |
---|---|
DOI: | 10.48550/arxiv.2309.08585 |