Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving
Proceedings of the Irish Machine Vision and Image Processing Conference 2023 This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performanc...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Rekanar, Kaavya Eising, Ciarán Sistu, Ganesh Hayes, Martin |
description | Proceedings of the Irish Machine Vision and Image Processing
Conference 2023 This short paper presents a preliminary analysis of three popular Visual
Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the
context of answering questions relating to driving scenarios. The performance
of these models is evaluated by comparing the similarity of responses to
reference answers provided by computer vision experts. Model selection is
predicated on the analysis of transformer utilization in multimodal
architectures. The results indicate that models incorporating cross-modal
attention and late fusion techniques exhibit promising potential for generating
improved answers within a driving perspective. This initial analysis serves as
a launchpad for a forthcoming comprehensive comparative study involving nine
VQA models and sets the scene for further investigations into the effectiveness
of VQA model queries in self-driving scenarios. Supplementary material is
available at
https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving. |
doi_str_mv | 10.48550/arxiv.2307.09329 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2307_09329</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2307_09329</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-60452ebb26d35b4325ade08a5b53ec3f77c337e20e534fd22f1dd4a3a7a2ab23</originalsourceid><addsrcrecordid>eNotj8tqwzAQAHXpoaT9gJ66P2BX0VpWfAyhLwiU0tCrWVvrIrAlI9lJ8_d1057mMDAwQtytZV5stJYPFL_dMVcoTS4rVNW14EM4UbQJCEaOXYgD-ZaBPPXn5BIED2PkbIrkPFv4dGmmHt5nTpNb3NanE0fnv2AIlvsESwFonoIPQ5gT2OiOi70RVx31iW__uRIfT4-H3Uu2f3t-3W33GZWmykpZaMVNo0qLuilQabIsN6QbjdxiZ0yLaFhJ1lh0VqlubW1BSIYUNQpX4v6vetmsx-gGiuf6d7e-7OIPPGZShQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</title><source>arXiv.org</source><creator>Rekanar, Kaavya ; Eising, Ciarán ; Sistu, Ganesh ; Hayes, Martin</creator><creatorcontrib>Rekanar, Kaavya ; Eising, Ciarán ; Sistu, Ganesh ; Hayes, Martin</creatorcontrib><description>Proceedings of the Irish Machine Vision and Image Processing
Conference 2023 This short paper presents a preliminary analysis of three popular Visual
Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the
context of answering questions relating to driving scenarios. The performance
of these models is evaluated by comparing the similarity of responses to
reference answers provided by computer vision experts. Model selection is
predicated on the analysis of transformer utilization in multimodal
architectures. The results indicate that models incorporating cross-modal
attention and late fusion techniques exhibit promising potential for generating
improved answers within a driving perspective. This initial analysis serves as
a launchpad for a forthcoming comprehensive comparative study involving nine
VQA models and sets the scene for further investigations into the effectiveness
of VQA model queries in self-driving scenarios. Supplementary material is
available at
https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.</description><identifier>DOI: 10.48550/arxiv.2307.09329</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2307.09329$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2307.09329$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Rekanar, Kaavya</creatorcontrib><creatorcontrib>Eising, Ciarán</creatorcontrib><creatorcontrib>Sistu, Ganesh</creatorcontrib><creatorcontrib>Hayes, Martin</creatorcontrib><title>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</title><description>Proceedings of the Irish Machine Vision and Image Processing
Conference 2023 This short paper presents a preliminary analysis of three popular Visual
Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the
context of answering questions relating to driving scenarios. The performance
of these models is evaluated by comparing the similarity of responses to
reference answers provided by computer vision experts. Model selection is
predicated on the analysis of transformer utilization in multimodal
architectures. The results indicate that models incorporating cross-modal
attention and late fusion techniques exhibit promising potential for generating
improved answers within a driving perspective. This initial analysis serves as
a launchpad for a forthcoming comprehensive comparative study involving nine
VQA models and sets the scene for further investigations into the effectiveness
of VQA model queries in self-driving scenarios. Supplementary material is
available at
https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tqwzAQAHXpoaT9gJ66P2BX0VpWfAyhLwiU0tCrWVvrIrAlI9lJ8_d1057mMDAwQtytZV5stJYPFL_dMVcoTS4rVNW14EM4UbQJCEaOXYgD-ZaBPPXn5BIED2PkbIrkPFv4dGmmHt5nTpNb3NanE0fnv2AIlvsESwFonoIPQ5gT2OiOi70RVx31iW__uRIfT4-H3Uu2f3t-3W33GZWmykpZaMVNo0qLuilQabIsN6QbjdxiZ0yLaFhJ1lh0VqlubW1BSIYUNQpX4v6vetmsx-gGiuf6d7e-7OIPPGZShQ</recordid><startdate>20230718</startdate><enddate>20230718</enddate><creator>Rekanar, Kaavya</creator><creator>Eising, Ciarán</creator><creator>Sistu, Ganesh</creator><creator>Hayes, Martin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230718</creationdate><title>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</title><author>Rekanar, Kaavya ; Eising, Ciarán ; Sistu, Ganesh ; Hayes, Martin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-60452ebb26d35b4325ade08a5b53ec3f77c337e20e534fd22f1dd4a3a7a2ab23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Rekanar, Kaavya</creatorcontrib><creatorcontrib>Eising, Ciarán</creatorcontrib><creatorcontrib>Sistu, Ganesh</creatorcontrib><creatorcontrib>Hayes, Martin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Rekanar, Kaavya</au><au>Eising, Ciarán</au><au>Sistu, Ganesh</au><au>Hayes, Martin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving</atitle><date>2023-07-18</date><risdate>2023</risdate><abstract>Proceedings of the Irish Machine Vision and Image Processing
Conference 2023 This short paper presents a preliminary analysis of three popular Visual
Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the
context of answering questions relating to driving scenarios. The performance
of these models is evaluated by comparing the similarity of responses to
reference answers provided by computer vision experts. Model selection is
predicated on the analysis of transformer utilization in multimodal
architectures. The results indicate that models incorporating cross-modal
attention and late fusion techniques exhibit promising potential for generating
improved answers within a driving perspective. This initial analysis serves as
a launchpad for a forthcoming comprehensive comparative study involving nine
VQA models and sets the scene for further investigations into the effectiveness
of VQA model queries in self-driving scenarios. Supplementary material is
available at
https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.</abstract><doi>10.48550/arxiv.2307.09329</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2307.09329 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2307_09329 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T17%3A38%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Towards%20a%20performance%20analysis%20on%20pre-trained%20Visual%20Question%20Answering%20models%20for%20autonomous%20driving&rft.au=Rekanar,%20Kaavya&rft.date=2023-07-18&rft_id=info:doi/10.48550/arxiv.2307.09329&rft_dat=%3Carxiv_GOX%3E2307_09329%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |