Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Lu, Xiaonan, Yuan, Jianlong, Niu, Ruigang, Hu, Yuan, Wang, Fan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Lu, Xiaonan
Yuan, Jianlong
Niu, Ruigang
Hu, Yuan
Wang, Fan
description Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.
doi_str_mv 10.48550/arxiv.2309.08585
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2309_08585</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2309_08585</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-e96ee79f46b35a9348bc9ebdee961109a40ef36366cdbaf86663ac268e3baac13</originalsourceid><addsrcrecordid>eNotj0FOwzAQRb1hgQoHYIUvkODUsWsvq4hCpCAkVLqNJvEktdraleNSuD1J29Uf_Tf60iPkKWNproRgLxB-7U8650ynTAkl7sluY_F89NZFWrqIfYBovaPgDP3C3g7xVpxt3NKNHaa7AtefoEe68idnrvzDG9zTzgdaHiZUbMcnpN_OYBjiOGdd_0DuOtgP-HjLGVmvXtfFe1J9vpXFskpALkSCWiIudJfLhgvQPFdNq7ExOIIsYxpyhh2XXMrWNNApKSWHdi4V8gagzfiMPF9nL7b1MdgDhL96sq4v1vwftlVU9Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding</title><source>arXiv.org</source><creator>Lu, Xiaonan ; Yuan, Jianlong ; Niu, Ruigang ; Hu, Yuan ; Wang, Fan</creator><creatorcontrib>Lu, Xiaonan ; Yuan, Jianlong ; Niu, Ruigang ; Hu, Yuan ; Wang, Fan</creatorcontrib><description>Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.</description><identifier>DOI: 10.48550/arxiv.2309.08585</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2309.08585$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2309.08585$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lu, Xiaonan</creatorcontrib><creatorcontrib>Yuan, Jianlong</creatorcontrib><creatorcontrib>Niu, Ruigang</creatorcontrib><creatorcontrib>Hu, Yuan</creatorcontrib><creatorcontrib>Wang, Fan</creatorcontrib><title>Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding</title><description>Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj0FOwzAQRb1hgQoHYIUvkODUsWsvq4hCpCAkVLqNJvEktdraleNSuD1J29Uf_Tf60iPkKWNproRgLxB-7U8650ynTAkl7sluY_F89NZFWrqIfYBovaPgDP3C3g7xVpxt3NKNHaa7AtefoEe68idnrvzDG9zTzgdaHiZUbMcnpN_OYBjiOGdd_0DuOtgP-HjLGVmvXtfFe1J9vpXFskpALkSCWiIudJfLhgvQPFdNq7ExOIIsYxpyhh2XXMrWNNApKSWHdi4V8gagzfiMPF9nL7b1MdgDhL96sq4v1vwftlVU9Q</recordid><startdate>20230915</startdate><enddate>20230915</enddate><creator>Lu, Xiaonan</creator><creator>Yuan, Jianlong</creator><creator>Niu, Ruigang</creator><creator>Hu, Yuan</creator><creator>Wang, Fan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230915</creationdate><title>Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding</title><author>Lu, Xiaonan ; Yuan, Jianlong ; Niu, Ruigang ; Hu, Yuan ; Wang, Fan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-e96ee79f46b35a9348bc9ebdee961109a40ef36366cdbaf86663ac268e3baac13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Lu, Xiaonan</creatorcontrib><creatorcontrib>Yuan, Jianlong</creatorcontrib><creatorcontrib>Niu, Ruigang</creatorcontrib><creatorcontrib>Hu, Yuan</creatorcontrib><creatorcontrib>Wang, Fan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lu, Xiaonan</au><au>Yuan, Jianlong</au><au>Niu, Ruigang</au><au>Hu, Yuan</au><au>Wang, Fan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding</atitle><date>2023-09-15</date><risdate>2023</risdate><abstract>Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.</abstract><doi>10.48550/arxiv.2309.08585</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2309.08585
ispartof
issn
language eng
recordid cdi_arxiv_primary_2309_08585
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T01%3A55%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Viewpoint%20Integration%20and%20Registration%20with%20Vision%20Language%20Foundation%20Model%20for%20Image%20Change%20Understanding&rft.au=Lu,%20Xiaonan&rft.date=2023-09-15&rft_id=info:doi/10.48550/arxiv.2309.08585&rft_dat=%3Carxiv_GOX%3E2309_08585%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true