Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison
Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existi...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Yang, Qian Yan, Weixiang Agrawal, Aishwarya |
description | Despite tremendous advancements, current state-of-the-art Vision-Language
Models (VLMs) are still far from perfect. They tend to hallucinate and may
generate biased responses. In such circumstances, having a way to assess the
reliability of a given response generated by a VLM is quite useful. Existing
methods, such as estimating uncertainty using answer likelihoods or
prompt-based confidence generation, often suffer from overconfidence. Other
methods use self-consistency comparison but are affected by confirmation
biases. To alleviate these, we propose Decompose and Compare Consistency (DeCC)
for reliability measurement. By comparing the consistency between the direct
answer generated using the VLM's internal reasoning process, and the indirect
answers obtained by decomposing the question into sub-questions and reasoning
over the sub-answers produced by the VLM, DeCC measures the reliability of
VLM's direct answer. Experiments across six vision-language tasks with three
VLMs show DeCC's reliability estimation achieves better correlation with task
accuracy compared to the existing methods. |
doi_str_mv | 10.48550/arxiv.2407.07840 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_07840</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_07840</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_078403</originalsourceid><addsrcrecordid>eNqFjrEOgkAQBa-xMOoHWHmdFXgqBGJnUGMhjSG2ZMXVbISD3CLK34sEEzurecXLZIQYz5Xt-K6rZmBeVNkLR3m28nxH9QVvMMmzImeUoC8yaDYYbKiZuESd1CsZIvDDkL7J0yHkqVxrfqKRR0wJzpRSWcuKQEbAd-uro5Jy_avp1MS5HoreFVLGUceBmOy2UbC32ry4MJSBqeNPZtxmLv8_3jSUSfQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison</title><source>arXiv.org</source><creator>Yang, Qian ; Yan, Weixiang ; Agrawal, Aishwarya</creator><creatorcontrib>Yang, Qian ; Yan, Weixiang ; Agrawal, Aishwarya</creatorcontrib><description>Despite tremendous advancements, current state-of-the-art Vision-Language
Models (VLMs) are still far from perfect. They tend to hallucinate and may
generate biased responses. In such circumstances, having a way to assess the
reliability of a given response generated by a VLM is quite useful. Existing
methods, such as estimating uncertainty using answer likelihoods or
prompt-based confidence generation, often suffer from overconfidence. Other
methods use self-consistency comparison but are affected by confirmation
biases. To alleviate these, we propose Decompose and Compare Consistency (DeCC)
for reliability measurement. By comparing the consistency between the direct
answer generated using the VLM's internal reasoning process, and the indirect
answers obtained by decomposing the question into sub-questions and reasoning
over the sub-answers produced by the VLM, DeCC measures the reliability of
VLM's direct answer. Experiments across six vision-language tasks with three
VLMs show DeCC's reliability estimation achieves better correlation with task
accuracy compared to the existing methods.</description><identifier>DOI: 10.48550/arxiv.2407.07840</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.07840$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.07840$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Yang, Qian</creatorcontrib><creatorcontrib>Yan, Weixiang</creatorcontrib><creatorcontrib>Agrawal, Aishwarya</creatorcontrib><title>Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison</title><description>Despite tremendous advancements, current state-of-the-art Vision-Language
Models (VLMs) are still far from perfect. They tend to hallucinate and may
generate biased responses. In such circumstances, having a way to assess the
reliability of a given response generated by a VLM is quite useful. Existing
methods, such as estimating uncertainty using answer likelihoods or
prompt-based confidence generation, often suffer from overconfidence. Other
methods use self-consistency comparison but are affected by confirmation
biases. To alleviate these, we propose Decompose and Compare Consistency (DeCC)
for reliability measurement. By comparing the consistency between the direct
answer generated using the VLM's internal reasoning process, and the indirect
answers obtained by decomposing the question into sub-questions and reasoning
over the sub-answers produced by the VLM, DeCC measures the reliability of
VLM's direct answer. Experiments across six vision-language tasks with three
VLMs show DeCC's reliability estimation achieves better correlation with task
accuracy compared to the existing methods.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjrEOgkAQBa-xMOoHWHmdFXgqBGJnUGMhjSG2ZMXVbISD3CLK34sEEzurecXLZIQYz5Xt-K6rZmBeVNkLR3m28nxH9QVvMMmzImeUoC8yaDYYbKiZuESd1CsZIvDDkL7J0yHkqVxrfqKRR0wJzpRSWcuKQEbAd-uro5Jy_avp1MS5HoreFVLGUceBmOy2UbC32ry4MJSBqeNPZtxmLv8_3jSUSfQ</recordid><startdate>20240710</startdate><enddate>20240710</enddate><creator>Yang, Qian</creator><creator>Yan, Weixiang</creator><creator>Agrawal, Aishwarya</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240710</creationdate><title>Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison</title><author>Yang, Qian ; Yan, Weixiang ; Agrawal, Aishwarya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_078403</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Yang, Qian</creatorcontrib><creatorcontrib>Yan, Weixiang</creatorcontrib><creatorcontrib>Agrawal, Aishwarya</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yang, Qian</au><au>Yan, Weixiang</au><au>Agrawal, Aishwarya</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison</atitle><date>2024-07-10</date><risdate>2024</risdate><abstract>Despite tremendous advancements, current state-of-the-art Vision-Language
Models (VLMs) are still far from perfect. They tend to hallucinate and may
generate biased responses. In such circumstances, having a way to assess the
reliability of a given response generated by a VLM is quite useful. Existing
methods, such as estimating uncertainty using answer likelihoods or
prompt-based confidence generation, often suffer from overconfidence. Other
methods use self-consistency comparison but are affected by confirmation
biases. To alleviate these, we propose Decompose and Compare Consistency (DeCC)
for reliability measurement. By comparing the consistency between the direct
answer generated using the VLM's internal reasoning process, and the indirect
answers obtained by decomposing the question into sub-questions and reasoning
over the sub-answers produced by the VLM, DeCC measures the reliability of
VLM's direct answer. Experiments across six vision-language tasks with three
VLMs show DeCC's reliability estimation achieves better correlation with task
accuracy compared to the existing methods.</abstract><doi>10.48550/arxiv.2407.07840</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2407.07840 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2407_07840 |
source | arXiv.org |
subjects | Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition |
title | Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T10%3A12%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Decompose%20and%20Compare%20Consistency:%20Measuring%20VLMs'%20Answer%20Reliability%20via%20Task-Decomposition%20Consistency%20Comparison&rft.au=Yang,%20Qian&rft.date=2024-07-10&rft_id=info:doi/10.48550/arxiv.2407.07840&rft_dat=%3Carxiv_GOX%3E2407_07840%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |