A System and Benchmark for LLM-based Q&A on Heterogeneous Data

In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-09
Hauptverfasser: Fokoue, Achille, Jayaraman, Srideepika, Khabiri, Elham, Kephart, Jeffrey O, Li, Yingjie, Shah, Dhruv, Drissi, Youssef, Heath, Fenno F, Bhamidipaty, Anu, Tipu, Fateh A, Baseman, Robert J
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Fokoue, Achille
Jayaraman, Srideepika
Khabiri, Elham
Kephart, Jeffrey O
Li, Yingjie
Shah, Dhruv
Drissi, Youssef
Heath, Fenno F
Bhamidipaty, Anu
Tipu, Fateh A
Baseman, Robert J
description In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3102579645</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3102579645</sourcerecordid><originalsourceid>FETCH-proquest_journals_31025796453</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwc1QIriwuSc1VSMxLUXBKzUvOyE0sylZIyy9S8PHx1U1KLE5NUQhUc1TIz1PwSC1JLcpPT81LzS8tVnBJLEnkYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0MDI1NzSzMTU2PiVAEA0X42fw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3102579645</pqid></control><display><type>article</type><title>A System and Benchmark for LLM-based Q&amp;A on Heterogeneous Data</title><source>Free E- Journals</source><creator>Fokoue, Achille ; Jayaraman, Srideepika ; Khabiri, Elham ; Kephart, Jeffrey O ; Li, Yingjie ; Shah, Dhruv ; Drissi, Youssef ; Heath, Fenno F ; Bhamidipaty, Anu ; Tipu, Fateh A ; Baseman, Robert J</creator><creatorcontrib>Fokoue, Achille ; Jayaraman, Srideepika ; Khabiri, Elham ; Kephart, Jeffrey O ; Li, Yingjie ; Shah, Dhruv ; Drissi, Youssef ; Heath, Fenno F ; Bhamidipaty, Anu ; Tipu, Fateh A ; Baseman, Robert J</creatorcontrib><description>In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Data retrieval ; Data sources ; Heterogeneity ; Large language models ; Natural language ; Questions ; Structured data</subject><ispartof>arXiv.org, 2024-09</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Fokoue, Achille</creatorcontrib><creatorcontrib>Jayaraman, Srideepika</creatorcontrib><creatorcontrib>Khabiri, Elham</creatorcontrib><creatorcontrib>Kephart, Jeffrey O</creatorcontrib><creatorcontrib>Li, Yingjie</creatorcontrib><creatorcontrib>Shah, Dhruv</creatorcontrib><creatorcontrib>Drissi, Youssef</creatorcontrib><creatorcontrib>Heath, Fenno F</creatorcontrib><creatorcontrib>Bhamidipaty, Anu</creatorcontrib><creatorcontrib>Tipu, Fateh A</creatorcontrib><creatorcontrib>Baseman, Robert J</creatorcontrib><title>A System and Benchmark for LLM-based Q&amp;A on Heterogeneous Data</title><title>arXiv.org</title><description>In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community</description><subject>Benchmarks</subject><subject>Data retrieval</subject><subject>Data sources</subject><subject>Heterogeneity</subject><subject>Large language models</subject><subject>Natural language</subject><subject>Questions</subject><subject>Structured data</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwc1QIriwuSc1VSMxLUXBKzUvOyE0sylZIyy9S8PHx1U1KLE5NUQhUc1TIz1PwSC1JLcpPT81LzS8tVnBJLEnkYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0MDI1NzSzMTU2PiVAEA0X42fw</recordid><startdate>20240910</startdate><enddate>20240910</enddate><creator>Fokoue, Achille</creator><creator>Jayaraman, Srideepika</creator><creator>Khabiri, Elham</creator><creator>Kephart, Jeffrey O</creator><creator>Li, Yingjie</creator><creator>Shah, Dhruv</creator><creator>Drissi, Youssef</creator><creator>Heath, Fenno F</creator><creator>Bhamidipaty, Anu</creator><creator>Tipu, Fateh A</creator><creator>Baseman, Robert J</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240910</creationdate><title>A System and Benchmark for LLM-based Q&amp;A on Heterogeneous Data</title><author>Fokoue, Achille ; Jayaraman, Srideepika ; Khabiri, Elham ; Kephart, Jeffrey O ; Li, Yingjie ; Shah, Dhruv ; Drissi, Youssef ; Heath, Fenno F ; Bhamidipaty, Anu ; Tipu, Fateh A ; Baseman, Robert J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31025796453</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Data retrieval</topic><topic>Data sources</topic><topic>Heterogeneity</topic><topic>Large language models</topic><topic>Natural language</topic><topic>Questions</topic><topic>Structured data</topic><toplevel>online_resources</toplevel><creatorcontrib>Fokoue, Achille</creatorcontrib><creatorcontrib>Jayaraman, Srideepika</creatorcontrib><creatorcontrib>Khabiri, Elham</creatorcontrib><creatorcontrib>Kephart, Jeffrey O</creatorcontrib><creatorcontrib>Li, Yingjie</creatorcontrib><creatorcontrib>Shah, Dhruv</creatorcontrib><creatorcontrib>Drissi, Youssef</creatorcontrib><creatorcontrib>Heath, Fenno F</creatorcontrib><creatorcontrib>Bhamidipaty, Anu</creatorcontrib><creatorcontrib>Tipu, Fateh A</creatorcontrib><creatorcontrib>Baseman, Robert J</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fokoue, Achille</au><au>Jayaraman, Srideepika</au><au>Khabiri, Elham</au><au>Kephart, Jeffrey O</au><au>Li, Yingjie</au><au>Shah, Dhruv</au><au>Drissi, Youssef</au><au>Heath, Fenno F</au><au>Bhamidipaty, Anu</au><au>Tipu, Fateh A</au><au>Baseman, Robert J</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A System and Benchmark for LLM-based Q&amp;A on Heterogeneous Data</atitle><jtitle>arXiv.org</jtitle><date>2024-09-10</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-09
issn 2331-8422
language eng
recordid cdi_proquest_journals_3102579645
source Free E- Journals
subjects Benchmarks
Data retrieval
Data sources
Heterogeneity
Large language models
Natural language
Questions
Structured data
title A System and Benchmark for LLM-based Q&A on Heterogeneous Data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T13%3A05%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20System%20and%20Benchmark%20for%20LLM-based%20Q&A%20on%20Heterogeneous%20Data&rft.jtitle=arXiv.org&rft.au=Fokoue,%20Achille&rft.date=2024-09-10&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3102579645%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3102579645&rft_id=info:pmid/&rfr_iscdi=true