CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Li, Junyan, Chen, Delin, Hong, Yining, Chen, Zhenfang, Chen, Peihao, Shen, Yikang, Gan, Chuang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Li, Junyan
Chen, Delin
Hong, Yining
Chen, Zhenfang
Chen, Peihao
Shen, Yikang
Gan, Chuang
description A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.
doi_str_mv 10.48550/arxiv.2311.03354
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_03354</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_03354</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-1e478c079d8fd051ccac30a2fdd5933ce2efd1613703c21f335a4080a085e8db3</originalsourceid><addsrcrecordid>eNotj8tOwzAQRb1hgQofwAr_QIIdx43LDoXykFIhoarbaLDHwVJiR3FSwd_jtmzmzuLeIx1C7jjLSyUle4Dpxx3zQnCeMyFkeU1sHQ7N7pHWYRhDdL6jBxcX6OnWz252GCl4Qz-xh9kFH7_dGKnztIGpw3R9t0B6dsFgH9MUTqBh8U6n_hHpM-pgEvWGXFnoI97-54rsX7b7-i1rPl7f66cmg3VVZhzLSmlWbYyyhkmuNWjBoLDGyI0QGgu0hq-5qJjQBbdJAUqmGDAlUZkvsSL3F-xZtB0nN8D0256E27Ow-AOljVEq</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding</title><source>arXiv.org</source><creator>Li, Junyan ; Chen, Delin ; Hong, Yining ; Chen, Zhenfang ; Chen, Peihao ; Shen, Yikang ; Gan, Chuang</creator><creatorcontrib>Li, Junyan ; Chen, Delin ; Hong, Yining ; Chen, Zhenfang ; Chen, Peihao ; Shen, Yikang ; Gan, Chuang</creatorcontrib><description>A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.</description><identifier>DOI: 10.48550/arxiv.2311.03354</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.03354$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.03354$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Junyan</creatorcontrib><creatorcontrib>Chen, Delin</creatorcontrib><creatorcontrib>Hong, Yining</creatorcontrib><creatorcontrib>Chen, Zhenfang</creatorcontrib><creatorcontrib>Chen, Peihao</creatorcontrib><creatorcontrib>Shen, Yikang</creatorcontrib><creatorcontrib>Gan, Chuang</creatorcontrib><title>CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding</title><description>A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAQRb1hgQofwAr_QIIdx43LDoXykFIhoarbaLDHwVJiR3FSwd_jtmzmzuLeIx1C7jjLSyUle4Dpxx3zQnCeMyFkeU1sHQ7N7pHWYRhDdL6jBxcX6OnWz252GCl4Qz-xh9kFH7_dGKnztIGpw3R9t0B6dsFgH9MUTqBh8U6n_hHpM-pgEvWGXFnoI97-54rsX7b7-i1rPl7f66cmg3VVZhzLSmlWbYyyhkmuNWjBoLDGyI0QGgu0hq-5qJjQBbdJAUqmGDAlUZkvsSL3F-xZtB0nN8D0256E27Ow-AOljVEq</recordid><startdate>20231106</startdate><enddate>20231106</enddate><creator>Li, Junyan</creator><creator>Chen, Delin</creator><creator>Hong, Yining</creator><creator>Chen, Zhenfang</creator><creator>Chen, Peihao</creator><creator>Shen, Yikang</creator><creator>Gan, Chuang</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231106</creationdate><title>CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding</title><author>Li, Junyan ; Chen, Delin ; Hong, Yining ; Chen, Zhenfang ; Chen, Peihao ; Shen, Yikang ; Gan, Chuang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-1e478c079d8fd051ccac30a2fdd5933ce2efd1613703c21f335a4080a085e8db3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Junyan</creatorcontrib><creatorcontrib>Chen, Delin</creatorcontrib><creatorcontrib>Hong, Yining</creatorcontrib><creatorcontrib>Chen, Zhenfang</creatorcontrib><creatorcontrib>Chen, Peihao</creatorcontrib><creatorcontrib>Shen, Yikang</creatorcontrib><creatorcontrib>Gan, Chuang</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Junyan</au><au>Chen, Delin</au><au>Hong, Yining</au><au>Chen, Zhenfang</au><au>Chen, Peihao</au><au>Shen, Yikang</au><au>Gan, Chuang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding</atitle><date>2023-11-06</date><risdate>2023</risdate><abstract>A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.</abstract><doi>10.48550/arxiv.2311.03354</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2311.03354
ispartof
issn
language eng
recordid cdi_arxiv_primary_2311_03354
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T15%3A19%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CoVLM:%20Composing%20Visual%20Entities%20and%20Relationships%20in%20Large%20Language%20Models%20Via%20Communicative%20Decoding&rft.au=Li,%20Junyan&rft.date=2023-11-06&rft_id=info:doi/10.48550/arxiv.2311.03354&rft_dat=%3Carxiv_GOX%3E2311_03354%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true