Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in d...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Chen, Long, Sinavski, Oleg, Hünermann, Jan, Karnsund, Alice, Willmott, Andrew James, Birch, Danny, Maund, Daniel, Shotton, Jamie
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Chen, Long
Sinavski, Oleg
Hünermann, Jan
Karnsund, Alice
Willmott, Andrew James
Birch, Danny
Maund, Daniel
Shotton, Jamie
description Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.
doi_str_mv 10.48550/arxiv.2310.01957
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_01957</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_01957</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-5e8ef98423675b7fefc0a35a22a4c23b24524becc23594d8a2728f7106fd8cd83</originalsourceid><addsrcrecordid>eNotj7tOwzAYhb0woMIDMOEXSHF8iR22qrSA5KpLhcQU_XZsMErjyrnQvj1p6XTO-YYjfQg95GTOlRDkCdIxjHPKJkDyUshb9PmSwhjaL_wb-m-s9aZ7xuuhO5Ot-XG2z7QbXYM_phoT3sQamtCfsJ_G6nhoILRgGocXQx_buI9Dh6-Pd-jGQ9O5-2vO0G692i3fMr19fV8udAaFlJlwyvlSccoKKYz0zlsCTAClwC1lhnJBuXF26qLktQIqqfIyJ4Wvla0Vm6HH_9uLW3VIYQ_pVJ0dq4sj-wMUUkye</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving</title><source>arXiv.org</source><creator>Chen, Long ; Sinavski, Oleg ; Hünermann, Jan ; Karnsund, Alice ; Willmott, Andrew James ; Birch, Danny ; Maund, Daniel ; Shotton, Jamie</creator><creatorcontrib>Chen, Long ; Sinavski, Oleg ; Hünermann, Jan ; Karnsund, Alice ; Willmott, Andrew James ; Birch, Danny ; Maund, Daniel ; Shotton, Jamie</creatorcontrib><description>Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.</description><identifier>DOI: 10.48550/arxiv.2310.01957</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Robotics</subject><creationdate>2023-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.01957$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.01957$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Long</creatorcontrib><creatorcontrib>Sinavski, Oleg</creatorcontrib><creatorcontrib>Hünermann, Jan</creatorcontrib><creatorcontrib>Karnsund, Alice</creatorcontrib><creatorcontrib>Willmott, Andrew James</creatorcontrib><creatorcontrib>Birch, Danny</creatorcontrib><creatorcontrib>Maund, Daniel</creatorcontrib><creatorcontrib>Shotton, Jamie</creatorcontrib><title>Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving</title><description>Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Robotics</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj7tOwzAYhb0woMIDMOEXSHF8iR22qrSA5KpLhcQU_XZsMErjyrnQvj1p6XTO-YYjfQg95GTOlRDkCdIxjHPKJkDyUshb9PmSwhjaL_wb-m-s9aZ7xuuhO5Ot-XG2z7QbXYM_phoT3sQamtCfsJ_G6nhoILRgGocXQx_buI9Dh6-Pd-jGQ9O5-2vO0G692i3fMr19fV8udAaFlJlwyvlSccoKKYz0zlsCTAClwC1lhnJBuXF26qLktQIqqfIyJ4Wvla0Vm6HH_9uLW3VIYQ_pVJ0dq4sj-wMUUkye</recordid><startdate>20231003</startdate><enddate>20231003</enddate><creator>Chen, Long</creator><creator>Sinavski, Oleg</creator><creator>Hünermann, Jan</creator><creator>Karnsund, Alice</creator><creator>Willmott, Andrew James</creator><creator>Birch, Danny</creator><creator>Maund, Daniel</creator><creator>Shotton, Jamie</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231003</creationdate><title>Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving</title><author>Chen, Long ; Sinavski, Oleg ; Hünermann, Jan ; Karnsund, Alice ; Willmott, Andrew James ; Birch, Danny ; Maund, Daniel ; Shotton, Jamie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-5e8ef98423675b7fefc0a35a22a4c23b24524becc23594d8a2728f7106fd8cd83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Robotics</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Long</creatorcontrib><creatorcontrib>Sinavski, Oleg</creatorcontrib><creatorcontrib>Hünermann, Jan</creatorcontrib><creatorcontrib>Karnsund, Alice</creatorcontrib><creatorcontrib>Willmott, Andrew James</creatorcontrib><creatorcontrib>Birch, Danny</creatorcontrib><creatorcontrib>Maund, Daniel</creatorcontrib><creatorcontrib>Shotton, Jamie</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Long</au><au>Sinavski, Oleg</au><au>Hünermann, Jan</au><au>Karnsund, Alice</au><au>Willmott, Andrew James</au><au>Birch, Danny</au><au>Maund, Daniel</au><au>Shotton, Jamie</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving</atitle><date>2023-10-03</date><risdate>2023</risdate><abstract>Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.</abstract><doi>10.48550/arxiv.2310.01957</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2310.01957
ispartof
issn
language eng
recordid cdi_arxiv_primary_2310_01957
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Robotics
title Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T02%3A36%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Driving%20with%20LLMs:%20Fusing%20Object-Level%20Vector%20Modality%20for%20Explainable%20Autonomous%20Driving&rft.au=Chen,%20Long&rft.date=2023-10-03&rft_id=info:doi/10.48550/arxiv.2310.01957&rft_dat=%3Carxiv_GOX%3E2310_01957%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true