Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performanc...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Rekanar, Kaavya, Eising, Ciarán, Sistu, Ganesh, Hayes, Martin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Rekanar, Kaavya Eising, Ciarán Sistu, Ganesh Hayes, Martin
description	Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.
doi_str_mv	10.48550/arxiv.2307.09329
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2307_09329</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2307_09329</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-60452ebb26d35b4325ade08a5b53ec3f77c337e20e534fd22f1dd4a3a7a2ab23</originalsourceid><addsrcrecordid>eNotj8tqwzAQAHXpoaT9gJ66P2BX0VpWfAyhLwiU0tCrWVvrIrAlI9lJ8_d1057mMDAwQtytZV5stJYPFL_dMVcoTS4rVNW14EM4UbQJCEaOXYgD-ZaBPPXn5BIED2PkbIrkPFv4dGmmHt5nTpNb3NanE0fnv2AIlvsESwFonoIPQ5gT2OiOi70RVx31iW__uRIfT4-H3Uu2f3t-3W33GZWmykpZaMVNo0qLuilQabIsN6QbjdxiZ0yLaFhJ1lh0VqlubW1BSIYUNQpX4v6vetmsx-gGiuf6d7e-7OIPPGZShQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</title><source>arXiv.org</source><creator>Rekanar, Kaavya ; Eising, Ciarán ; Sistu, Ganesh ; Hayes, Martin</creator><creatorcontrib>Rekanar, Kaavya ; Eising, Ciarán ; Sistu, Ganesh ; Hayes, Martin</creatorcontrib><description>Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.</description><identifier>DOI: 10.48550/arxiv.2307.09329</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2307.09329$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2307.09329$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Rekanar, Kaavya</creatorcontrib><creatorcontrib>Eising, Ciarán</creatorcontrib><creatorcontrib>Sistu, Ganesh</creatorcontrib><creatorcontrib>Hayes, Martin</creatorcontrib><title>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</title><description>Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tqwzAQAHXpoaT9gJ66P2BX0VpWfAyhLwiU0tCrWVvrIrAlI9lJ8_d1057mMDAwQtytZV5stJYPFL_dMVcoTS4rVNW14EM4UbQJCEaOXYgD-ZaBPPXn5BIED2PkbIrkPFv4dGmmHt5nTpNb3NanE0fnv2AIlvsESwFonoIPQ5gT2OiOi70RVx31iW__uRIfT4-H3Uu2f3t-3W33GZWmykpZaMVNo0qLuilQabIsN6QbjdxiZ0yLaFhJ1lh0VqlubW1BSIYUNQpX4v6vetmsx-gGiuf6d7e-7OIPPGZShQ</recordid><startdate>20230718</startdate><enddate>20230718</enddate><creator>Rekanar, Kaavya</creator><creator>Eising, Ciarán</creator><creator>Sistu, Ganesh</creator><creator>Hayes, Martin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230718</creationdate><title>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</title><author>Rekanar, Kaavya ; Eising, Ciarán ; Sistu, Ganesh ; Hayes, Martin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-60452ebb26d35b4325ade08a5b53ec3f77c337e20e534fd22f1dd4a3a7a2ab23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Rekanar, Kaavya</creatorcontrib><creatorcontrib>Eising, Ciarán</creatorcontrib><creatorcontrib>Sistu, Ganesh</creatorcontrib><creatorcontrib>Hayes, Martin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Rekanar, Kaavya</au><au>Eising, Ciarán</au><au>Sistu, Ganesh</au><au>Hayes, Martin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</atitle><date>2023-07-18</date><risdate>2023</risdate><abstract>Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.</abstract><doi>10.48550/arxiv.2307.09329</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2307.09329
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2307_09329
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T17%3A38%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Towards%20a%20performance%20analysis%20on%20pre-trained%20Visual%20Question%20Answering%20models%20for%20autonomous%20driving&rft.au=Rekanar,%20Kaavya&rft.date=2023-07-18&rft_id=info:doi/10.48550/arxiv.2307.09329&rft_dat=%3Carxiv_GOX%3E2307_09329%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true