Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performanc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Rekanar, Kaavya, Eising, Ciarán, Sistu, Ganesh, Hayes, Martin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Rekanar, Kaavya
Eising, Ciarán
Sistu, Ganesh
Hayes, Martin
description Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.
doi_str_mv 10.48550/arxiv.2307.09329
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2307_09329</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2307_09329</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-60452ebb26d35b4325ade08a5b53ec3f77c337e20e534fd22f1dd4a3a7a2ab23</originalsourceid><addsrcrecordid>eNotj8tqwzAQAHXpoaT9gJ66P2BX0VpWfAyhLwiU0tCrWVvrIrAlI9lJ8_d1057mMDAwQtytZV5stJYPFL_dMVcoTS4rVNW14EM4UbQJCEaOXYgD-ZaBPPXn5BIED2PkbIrkPFv4dGmmHt5nTpNb3NanE0fnv2AIlvsESwFonoIPQ5gT2OiOi70RVx31iW__uRIfT4-H3Uu2f3t-3W33GZWmykpZaMVNo0qLuilQabIsN6QbjdxiZ0yLaFhJ1lh0VqlubW1BSIYUNQpX4v6vetmsx-gGiuf6d7e-7OIPPGZShQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</title><source>arXiv.org</source><creator>Rekanar, Kaavya ; Eising, Ciarán ; Sistu, Ganesh ; Hayes, Martin</creator><creatorcontrib>Rekanar, Kaavya ; Eising, Ciarán ; Sistu, Ganesh ; Hayes, Martin</creatorcontrib><description>Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.</description><identifier>DOI: 10.48550/arxiv.2307.09329</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2307.09329$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2307.09329$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Rekanar, Kaavya</creatorcontrib><creatorcontrib>Eising, Ciarán</creatorcontrib><creatorcontrib>Sistu, Ganesh</creatorcontrib><creatorcontrib>Hayes, Martin</creatorcontrib><title>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</title><description>Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tqwzAQAHXpoaT9gJ66P2BX0VpWfAyhLwiU0tCrWVvrIrAlI9lJ8_d1057mMDAwQtytZV5stJYPFL_dMVcoTS4rVNW14EM4UbQJCEaOXYgD-ZaBPPXn5BIED2PkbIrkPFv4dGmmHt5nTpNb3NanE0fnv2AIlvsESwFonoIPQ5gT2OiOi70RVx31iW__uRIfT4-H3Uu2f3t-3W33GZWmykpZaMVNo0qLuilQabIsN6QbjdxiZ0yLaFhJ1lh0VqlubW1BSIYUNQpX4v6vetmsx-gGiuf6d7e-7OIPPGZShQ</recordid><startdate>20230718</startdate><enddate>20230718</enddate><creator>Rekanar, Kaavya</creator><creator>Eising, Ciarán</creator><creator>Sistu, Ganesh</creator><creator>Hayes, Martin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230718</creationdate><title>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</title><author>Rekanar, Kaavya ; Eising, Ciarán ; Sistu, Ganesh ; Hayes, Martin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-60452ebb26d35b4325ade08a5b53ec3f77c337e20e534fd22f1dd4a3a7a2ab23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Rekanar, Kaavya</creatorcontrib><creatorcontrib>Eising, Ciarán</creatorcontrib><creatorcontrib>Sistu, Ganesh</creatorcontrib><creatorcontrib>Hayes, Martin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Rekanar, Kaavya</au><au>Eising, Ciarán</au><au>Sistu, Ganesh</au><au>Hayes, Martin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</atitle><date>2023-07-18</date><risdate>2023</risdate><abstract>Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.</abstract><doi>10.48550/arxiv.2307.09329</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2307.09329
ispartof
issn
language eng
recordid cdi_arxiv_primary_2307_09329
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T17%3A38%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Towards%20a%20performance%20analysis%20on%20pre-trained%20Visual%20Question%20Answering%20models%20for%20autonomous%20driving&rft.au=Rekanar,%20Kaavya&rft.date=2023-07-18&rft_id=info:doi/10.48550/arxiv.2307.09329&rft_dat=%3Carxiv_GOX%3E2307_09329%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true