Biomedical knowledge graph-optimized prompt generation for large language models

Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, requiring further domain expertise. Here, we intr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-05
Hauptverfasser:	Soman, Karthik, Rose, Peter W, Morris, John H, Akbas, Rabia E, Smith, Brett, Peetoom, Braian, Villouta-Reyes, Catalina, Cerono, Gabriel, Shi, Yongmei, Rizk-Jackson, Angela, Israni, Sharat, Nelson, Charlotte A, Huang, Sui, Baranzini, Sergio E
Format:	Artikel
Sprache:	eng
Schlagworte:	Context Datasets Knowledge Knowledge representation Large language models Questions Robustness (mathematics)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Soman, Karthik Rose, Peter W Morris, John H Akbas, Rabia E Smith, Brett Peetoom, Braian Villouta-Reyes, Catalina Cerono, Gabriel Shi, Yongmei Rizk-Jackson, Angela Israni, Sharat Nelson, Charlotte A Huang, Sui Baranzini, Sergio E
description	Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, requiring further domain expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge. Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3055243360</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3055243360</sourcerecordid><originalsourceid>FETCH-proquest_journals_30552433603</originalsourceid><addsrcrecordid>eNqNjMEKAiEURSUIGmr-QWg9YDpOrYuiZYv2Ic0bc1KfqUPQ1-eiD2h1D5zDnZGKC7Fpdi3nC1KnNDLGeLflUoqKXPYGHfTmrix9enxb6DVQHVV4NBiyceYDPQ0RXchUg4eoskFPB4zUqlhaq7yeVAGHPdi0IvNB2QT1b5dkfTpeD-emfLwmSPk24hR9UTfBpOStEB0T_1Vf00s_uA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3055243360</pqid></control><display><type>article</type><title>Biomedical knowledge graph-optimized prompt generation for large language models</title><source>Freely Accessible Journals</source><creator>Soman, Karthik ; Rose, Peter W ; Morris, John H ; Akbas, Rabia E ; Smith, Brett ; Peetoom, Braian ; Villouta-Reyes, Catalina ; Cerono, Gabriel ; Shi, Yongmei ; Rizk-Jackson, Angela ; Israni, Sharat ; Nelson, Charlotte A ; Huang, Sui ; Baranzini, Sergio E</creator><creatorcontrib>Soman, Karthik ; Rose, Peter W ; Morris, John H ; Akbas, Rabia E ; Smith, Brett ; Peetoom, Braian ; Villouta-Reyes, Catalina ; Cerono, Gabriel ; Shi, Yongmei ; Rizk-Jackson, Angela ; Israni, Sharat ; Nelson, Charlotte A ; Huang, Sui ; Baranzini, Sergio E</creatorcontrib><description>Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, requiring further domain expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge. Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Context ; Datasets ; Knowledge ; Knowledge representation ; Large language models ; Questions ; Robustness (mathematics)</subject><ispartof>arXiv.org, 2024-05</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Soman, Karthik</creatorcontrib><creatorcontrib>Rose, Peter W</creatorcontrib><creatorcontrib>Morris, John H</creatorcontrib><creatorcontrib>Akbas, Rabia E</creatorcontrib><creatorcontrib>Smith, Brett</creatorcontrib><creatorcontrib>Peetoom, Braian</creatorcontrib><creatorcontrib>Villouta-Reyes, Catalina</creatorcontrib><creatorcontrib>Cerono, Gabriel</creatorcontrib><creatorcontrib>Shi, Yongmei</creatorcontrib><creatorcontrib>Rizk-Jackson, Angela</creatorcontrib><creatorcontrib>Israni, Sharat</creatorcontrib><creatorcontrib>Nelson, Charlotte A</creatorcontrib><creatorcontrib>Huang, Sui</creatorcontrib><creatorcontrib>Baranzini, Sergio E</creatorcontrib><title>Biomedical knowledge graph-optimized prompt generation for large language models</title><title>arXiv.org</title><description>Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, requiring further domain expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge. Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.</description><subject>Context</subject><subject>Datasets</subject><subject>Knowledge</subject><subject>Knowledge representation</subject><subject>Large language models</subject><subject>Questions</subject><subject>Robustness (mathematics)</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjMEKAiEURSUIGmr-QWg9YDpOrYuiZYv2Ic0bc1KfqUPQ1-eiD2h1D5zDnZGKC7Fpdi3nC1KnNDLGeLflUoqKXPYGHfTmrix9enxb6DVQHVV4NBiyceYDPQ0RXchUg4eoskFPB4zUqlhaq7yeVAGHPdi0IvNB2QT1b5dkfTpeD-emfLwmSPk24hR9UTfBpOStEB0T_1Vf00s_uA</recordid><startdate>20240513</startdate><enddate>20240513</enddate><creator>Soman, Karthik</creator><creator>Rose, Peter W</creator><creator>Morris, John H</creator><creator>Akbas, Rabia E</creator><creator>Smith, Brett</creator><creator>Peetoom, Braian</creator><creator>Villouta-Reyes, Catalina</creator><creator>Cerono, Gabriel</creator><creator>Shi, Yongmei</creator><creator>Rizk-Jackson, Angela</creator><creator>Israni, Sharat</creator><creator>Nelson, Charlotte A</creator><creator>Huang, Sui</creator><creator>Baranzini, Sergio E</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240513</creationdate><title>Biomedical knowledge graph-optimized prompt generation for large language models</title><author>Soman, Karthik ; Rose, Peter W ; Morris, John H ; Akbas, Rabia E ; Smith, Brett ; Peetoom, Braian ; Villouta-Reyes, Catalina ; Cerono, Gabriel ; Shi, Yongmei ; Rizk-Jackson, Angela ; Israni, Sharat ; Nelson, Charlotte A ; Huang, Sui ; Baranzini, Sergio E</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30552433603</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Context</topic><topic>Datasets</topic><topic>Knowledge</topic><topic>Knowledge representation</topic><topic>Large language models</topic><topic>Questions</topic><topic>Robustness (mathematics)</topic><toplevel>online_resources</toplevel><creatorcontrib>Soman, Karthik</creatorcontrib><creatorcontrib>Rose, Peter W</creatorcontrib><creatorcontrib>Morris, John H</creatorcontrib><creatorcontrib>Akbas, Rabia E</creatorcontrib><creatorcontrib>Smith, Brett</creatorcontrib><creatorcontrib>Peetoom, Braian</creatorcontrib><creatorcontrib>Villouta-Reyes, Catalina</creatorcontrib><creatorcontrib>Cerono, Gabriel</creatorcontrib><creatorcontrib>Shi, Yongmei</creatorcontrib><creatorcontrib>Rizk-Jackson, Angela</creatorcontrib><creatorcontrib>Israni, Sharat</creatorcontrib><creatorcontrib>Nelson, Charlotte A</creatorcontrib><creatorcontrib>Huang, Sui</creatorcontrib><creatorcontrib>Baranzini, Sergio E</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Soman, Karthik</au><au>Rose, Peter W</au><au>Morris, John H</au><au>Akbas, Rabia E</au><au>Smith, Brett</au><au>Peetoom, Braian</au><au>Villouta-Reyes, Catalina</au><au>Cerono, Gabriel</au><au>Shi, Yongmei</au><au>Rizk-Jackson, Angela</au><au>Israni, Sharat</au><au>Nelson, Charlotte A</au><au>Huang, Sui</au><au>Baranzini, Sergio E</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Biomedical knowledge graph-optimized prompt generation for large language models</atitle><jtitle>arXiv.org</jtitle><date>2024-05-13</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine. Solutions such as pre-training and domain-specific fine-tuning add substantial computational overhead, requiring further domain expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo and GPT-4, to generate meaningful biomedical text rooted in established knowledge. Compared to the existing RAG technique for Knowledge Graphs, the proposed method utilizes minimal graph schema for context extraction and uses embedding methods for context pruning. This optimization in context extraction results in more than 50% reduction in token consumption without compromising the accuracy, making a cost-effective and robust RAG implementation on proprietary LLMs. KG-RAG consistently enhanced the performance of LLMs across diverse biomedical prompts by generating responses rooted in established knowledge, accompanied by accurate provenance and statistical evidence (if available) to substantiate the claims. Further benchmarking on human curated datasets, such as biomedical true/false and multiple-choice questions (MCQ), showed a remarkable 71% boost in the performance of the Llama-2 model on the challenging MCQ dataset, demonstrating the framework's capacity to empower open-source models with fewer parameters for domain specific questions. Furthermore, KG-RAG enhanced the performance of proprietary GPT models, such as GPT-3.5 and GPT-4. In summary, the proposed framework combines explicit and implicit knowledge of KG and LLM in a token optimized fashion, thus enhancing the adaptability of general-purpose LLMs to tackle domain-specific questions in a cost-effective fashion.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-05
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3055243360
source	Freely Accessible Journals
subjects	Context Datasets Knowledge Knowledge representation Large language models Questions Robustness (mathematics)
title	Biomedical knowledge graph-optimized prompt generation for large language models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T08%3A31%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Biomedical%20knowledge%20graph-optimized%20prompt%20generation%20for%20large%20language%20models&rft.jtitle=arXiv.org&rft.au=Soman,%20Karthik&rft.date=2024-05-13&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3055243360%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3055243360&rft_id=info:pmid/&rfr_iscdi=true