A System and Benchmark for LLM-based Q&A on Heterogeneous Data

In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-09
Hauptverfasser:	Fokoue, Achille, Jayaraman, Srideepika, Khabiri, Elham, Kephart, Jeffrey O, Li, Yingjie, Shah, Dhruv, Drissi, Youssef, Heath, Fenno F, Bhamidipaty, Anu, Tipu, Fateh A, Baseman, Robert J
Format:	Artikel
Sprache:	eng
Schlagworte:	Benchmarks Data retrieval Data sources Heterogeneity Large language models Natural language Questions Structured data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Fokoue, Achille Jayaraman, Srideepika Khabiri, Elham Kephart, Jeffrey O Li, Yingjie Shah, Dhruv Drissi, Youssef Heath, Fenno F Bhamidipaty, Anu Tipu, Fateh A Baseman, Robert J
description	In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3102579645</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3102579645</sourcerecordid><originalsourceid>FETCH-proquest_journals_31025796453</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwc1QIriwuSc1VSMxLUXBKzUvOyE0sylZIyy9S8PHx1U1KLE5NUQhUc1TIz1PwSC1JLcpPT81LzS8tVnBJLEnkYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0MDI1NzSzMTU2PiVAEA0X42fw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3102579645</pqid></control><display><type>article</type><title>A System and Benchmark for LLM-based Q&A on Heterogeneous Data</title><source>Free E- Journals</source><creator>Fokoue, Achille ; Jayaraman, Srideepika ; Khabiri, Elham ; Kephart, Jeffrey O ; Li, Yingjie ; Shah, Dhruv ; Drissi, Youssef ; Heath, Fenno F ; Bhamidipaty, Anu ; Tipu, Fateh A ; Baseman, Robert J</creator><creatorcontrib>Fokoue, Achille ; Jayaraman, Srideepika ; Khabiri, Elham ; Kephart, Jeffrey O ; Li, Yingjie ; Shah, Dhruv ; Drissi, Youssef ; Heath, Fenno F ; Bhamidipaty, Anu ; Tipu, Fateh A ; Baseman, Robert J</creatorcontrib><description>In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Data retrieval ; Data sources ; Heterogeneity ; Large language models ; Natural language ; Questions ; Structured data</subject><ispartof>arXiv.org, 2024-09</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Fokoue, Achille</creatorcontrib><creatorcontrib>Jayaraman, Srideepika</creatorcontrib><creatorcontrib>Khabiri, Elham</creatorcontrib><creatorcontrib>Kephart, Jeffrey O</creatorcontrib><creatorcontrib>Li, Yingjie</creatorcontrib><creatorcontrib>Shah, Dhruv</creatorcontrib><creatorcontrib>Drissi, Youssef</creatorcontrib><creatorcontrib>Heath, Fenno F</creatorcontrib><creatorcontrib>Bhamidipaty, Anu</creatorcontrib><creatorcontrib>Tipu, Fateh A</creatorcontrib><creatorcontrib>Baseman, Robert J</creatorcontrib><title>A System and Benchmark for LLM-based Q&A on Heterogeneous Data</title><title>arXiv.org</title><description>In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community</description><subject>Benchmarks</subject><subject>Data retrieval</subject><subject>Data sources</subject><subject>Heterogeneity</subject><subject>Large language models</subject><subject>Natural language</subject><subject>Questions</subject><subject>Structured data</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwc1QIriwuSc1VSMxLUXBKzUvOyE0sylZIyy9S8PHx1U1KLE5NUQhUc1TIz1PwSC1JLcpPT81LzS8tVnBJLEnkYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4Y0MDI1NzSzMTU2PiVAEA0X42fw</recordid><startdate>20240910</startdate><enddate>20240910</enddate><creator>Fokoue, Achille</creator><creator>Jayaraman, Srideepika</creator><creator>Khabiri, Elham</creator><creator>Kephart, Jeffrey O</creator><creator>Li, Yingjie</creator><creator>Shah, Dhruv</creator><creator>Drissi, Youssef</creator><creator>Heath, Fenno F</creator><creator>Bhamidipaty, Anu</creator><creator>Tipu, Fateh A</creator><creator>Baseman, Robert J</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240910</creationdate><title>A System and Benchmark for LLM-based Q&A on Heterogeneous Data</title><author>Fokoue, Achille ; Jayaraman, Srideepika ; Khabiri, Elham ; Kephart, Jeffrey O ; Li, Yingjie ; Shah, Dhruv ; Drissi, Youssef ; Heath, Fenno F ; Bhamidipaty, Anu ; Tipu, Fateh A ; Baseman, Robert J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31025796453</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Data retrieval</topic><topic>Data sources</topic><topic>Heterogeneity</topic><topic>Large language models</topic><topic>Natural language</topic><topic>Questions</topic><topic>Structured data</topic><toplevel>online_resources</toplevel><creatorcontrib>Fokoue, Achille</creatorcontrib><creatorcontrib>Jayaraman, Srideepika</creatorcontrib><creatorcontrib>Khabiri, Elham</creatorcontrib><creatorcontrib>Kephart, Jeffrey O</creatorcontrib><creatorcontrib>Li, Yingjie</creatorcontrib><creatorcontrib>Shah, Dhruv</creatorcontrib><creatorcontrib>Drissi, Youssef</creatorcontrib><creatorcontrib>Heath, Fenno F</creatorcontrib><creatorcontrib>Bhamidipaty, Anu</creatorcontrib><creatorcontrib>Tipu, Fateh A</creatorcontrib><creatorcontrib>Baseman, Robert J</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fokoue, Achille</au><au>Jayaraman, Srideepika</au><au>Khabiri, Elham</au><au>Kephart, Jeffrey O</au><au>Li, Yingjie</au><au>Shah, Dhruv</au><au>Drissi, Youssef</au><au>Heath, Fenno F</au><au>Bhamidipaty, Anu</au><au>Tipu, Fateh A</au><au>Baseman, Robert J</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A System and Benchmark for LLM-based Q&A on Heterogeneous Data</atitle><jtitle>arXiv.org</jtitle><date>2024-09-10</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>In many industrial settings, users wish to ask questions whose answers may be found in structured data sources such as a spreadsheets, databases, APIs, or combinations thereof. Often, the user doesn't know how to identify or access the right data source. This problem is compounded even further if multiple (and potentially siloed) data sources must be assembled to derive the answer. Recently, various Text-to-SQL applications that leverage Large Language Models (LLMs) have addressed some of these problems by enabling users to ask questions in natural language. However, these applications remain impractical in realistic industrial settings because they fail to cope with the data source heterogeneity that typifies such environments. In this paper, we address heterogeneity by introducing the siwarex platform, which enables seamless natural language access to both databases and APIs. To demonstrate the effectiveness of siwarex, we extend the popular Spider dataset and benchmark by replacing some of its tables by data retrieval APIs. We find that siwarex does a good job of coping with data source heterogeneity. Our modified Spider benchmark will soon be available to the research community</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-09
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3102579645
source	Free E- Journals
subjects	Benchmarks Data retrieval Data sources Heterogeneity Large language models Natural language Questions Structured data
title	A System and Benchmark for LLM-based Q&A on Heterogeneous Data
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T13%3A05%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20System%20and%20Benchmark%20for%20LLM-based%20Q&A%20on%20Heterogeneous%20Data&rft.jtitle=arXiv.org&rft.au=Fokoue,%20Achille&rft.date=2024-09-10&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3102579645%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3102579645&rft_id=info:pmid/&rfr_iscdi=true