Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Lu, Xiaonan, Yuan, Jianlong, Niu, Ruigang, Hu, Yuan, Wang, Fan
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Lu, Xiaonan Yuan, Jianlong Niu, Ruigang Hu, Yuan Wang, Fan
description	Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.
doi_str_mv	10.48550/arxiv.2309.08585
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2309_08585</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2309_08585</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-e96ee79f46b35a9348bc9ebdee961109a40ef36366cdbaf86663ac268e3baac13</originalsourceid><addsrcrecordid>eNotj0FOwzAQRb1hgQoHYIUvkODUsWsvq4hCpCAkVLqNJvEktdraleNSuD1J29Uf_Tf60iPkKWNproRgLxB-7U8650ynTAkl7sluY_F89NZFWrqIfYBovaPgDP3C3g7xVpxt3NKNHaa7AtefoEe68idnrvzDG9zTzgdaHiZUbMcnpN_OYBjiOGdd_0DuOtgP-HjLGVmvXtfFe1J9vpXFskpALkSCWiIudJfLhgvQPFdNq7ExOIIsYxpyhh2XXMrWNNApKSWHdi4V8gagzfiMPF9nL7b1MdgDhL96sq4v1vwftlVU9Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding</title><source>arXiv.org</source><creator>Lu, Xiaonan ; Yuan, Jianlong ; Niu, Ruigang ; Hu, Yuan ; Wang, Fan</creator><creatorcontrib>Lu, Xiaonan ; Yuan, Jianlong ; Niu, Ruigang ; Hu, Yuan ; Wang, Fan</creatorcontrib><description>Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.</description><identifier>DOI: 10.48550/arxiv.2309.08585</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2309.08585$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2309.08585$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lu, Xiaonan</creatorcontrib><creatorcontrib>Yuan, Jianlong</creatorcontrib><creatorcontrib>Niu, Ruigang</creatorcontrib><creatorcontrib>Hu, Yuan</creatorcontrib><creatorcontrib>Wang, Fan</creatorcontrib><title>Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding</title><description>Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj0FOwzAQRb1hgQoHYIUvkODUsWsvq4hCpCAkVLqNJvEktdraleNSuD1J29Uf_Tf60iPkKWNproRgLxB-7U8650ynTAkl7sluY_F89NZFWrqIfYBovaPgDP3C3g7xVpxt3NKNHaa7AtefoEe68idnrvzDG9zTzgdaHiZUbMcnpN_OYBjiOGdd_0DuOtgP-HjLGVmvXtfFe1J9vpXFskpALkSCWiIudJfLhgvQPFdNq7ExOIIsYxpyhh2XXMrWNNApKSWHdi4V8gagzfiMPF9nL7b1MdgDhL96sq4v1vwftlVU9Q</recordid><startdate>20230915</startdate><enddate>20230915</enddate><creator>Lu, Xiaonan</creator><creator>Yuan, Jianlong</creator><creator>Niu, Ruigang</creator><creator>Hu, Yuan</creator><creator>Wang, Fan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230915</creationdate><title>Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding</title><author>Lu, Xiaonan ; Yuan, Jianlong ; Niu, Ruigang ; Hu, Yuan ; Wang, Fan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-e96ee79f46b35a9348bc9ebdee961109a40ef36366cdbaf86663ac268e3baac13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Lu, Xiaonan</creatorcontrib><creatorcontrib>Yuan, Jianlong</creatorcontrib><creatorcontrib>Niu, Ruigang</creatorcontrib><creatorcontrib>Hu, Yuan</creatorcontrib><creatorcontrib>Wang, Fan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lu, Xiaonan</au><au>Yuan, Jianlong</au><au>Niu, Ruigang</au><au>Hu, Yuan</au><au>Wang, Fan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding</atitle><date>2023-09-15</date><risdate>2023</risdate><abstract>Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.</abstract><doi>10.48550/arxiv.2309.08585</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2309.08585
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2309_08585
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T01%3A55%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Viewpoint%20Integration%20and%20Registration%20with%20Vision%20Language%20Foundation%20Model%20for%20Image%20Change%20Understanding&rft.au=Lu,%20Xiaonan&rft.date=2023-09-15&rft_id=info:doi/10.48550/arxiv.2309.08585&rft_dat=%3Carxiv_GOX%3E2309_08585%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true