Evaluating ChatGPT as a Question Answering System: A Comprehensive Analysis and Comparison with Existing Models
In the current era, a multitude of language models has emerged to cater to user inquiries. Notably, the GPT-3.5 Turbo language model has gained substantial attention as the underlying technology for ChatGPT. Leveraging extensive parameters, this model adeptly responds to a wide range of questions. H...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2023-12 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Bahak, Hossein Taheri, Farzaneh Zojaji, Zahra Kazemi, Arefeh |
description | In the current era, a multitude of language models has emerged to cater to user inquiries. Notably, the GPT-3.5 Turbo language model has gained substantial attention as the underlying technology for ChatGPT. Leveraging extensive parameters, this model adeptly responds to a wide range of questions. However, due to its reliance on internal knowledge, the accuracy of responses may not be absolute. This article scrutinizes ChatGPT as a Question Answering System (QAS), comparing its performance to other existing QASs. The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs, a core QAS capability. Additionally, performance comparisons are made in scenarios without a surrounding passage. Multiple experiments, exploring response hallucination and considering question complexity, were conducted on ChatGPT. Evaluation employed well-known Question Answering (QA) datasets, including SQuAD, NewsQA, and PersianQuAD, across English and Persian languages. Metrics such as F-score, exact match, and accuracy were employed in the assessment. The study reveals that, while ChatGPT demonstrates competence as a generative model, it is less effective in question answering compared to task-specific models. Providing context improves its performance, and prompt engineering enhances precision, particularly for questions lacking explicit answers in provided paragraphs. ChatGPT excels at simpler factual questions compared to "how" and "why" question types. The evaluation highlights occurrences of hallucinations, where ChatGPT provides responses to questions without available answers in the provided context. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2901609549</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2901609549</sourcerecordid><originalsourceid>FETCH-proquest_journals_29016095493</originalsourceid><addsrcrecordid>eNqNjd8KgjAchUcQJOU7DLoW5vxTdididRMUeS8DV050s_2m5tun0QN0dS7Od76zQBb1PNfZ-5SukA1QEUJouKNB4FlIpT2rO2aEfOKkZOZ0zTADzPCt42CEkjiWMHA99_cRDG8OOMaJalrNSy5B9HwiWD2CmFay-FZMC5iWgzAlTt8CvvaLKngNG7R8sBq4_cs12h7TLDk7rVav-TKvVKcnIeQ0Im5IosCPvP-oDzU2Shk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2901609549</pqid></control><display><type>article</type><title>Evaluating ChatGPT as a Question Answering System: A Comprehensive Analysis and Comparison with Existing Models</title><source>Free E- Journals</source><creator>Bahak, Hossein ; Taheri, Farzaneh ; Zojaji, Zahra ; Kazemi, Arefeh</creator><creatorcontrib>Bahak, Hossein ; Taheri, Farzaneh ; Zojaji, Zahra ; Kazemi, Arefeh</creatorcontrib><description>In the current era, a multitude of language models has emerged to cater to user inquiries. Notably, the GPT-3.5 Turbo language model has gained substantial attention as the underlying technology for ChatGPT. Leveraging extensive parameters, this model adeptly responds to a wide range of questions. However, due to its reliance on internal knowledge, the accuracy of responses may not be absolute. This article scrutinizes ChatGPT as a Question Answering System (QAS), comparing its performance to other existing QASs. The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs, a core QAS capability. Additionally, performance comparisons are made in scenarios without a surrounding passage. Multiple experiments, exploring response hallucination and considering question complexity, were conducted on ChatGPT. Evaluation employed well-known Question Answering (QA) datasets, including SQuAD, NewsQA, and PersianQuAD, across English and Persian languages. Metrics such as F-score, exact match, and accuracy were employed in the assessment. The study reveals that, while ChatGPT demonstrates competence as a generative model, it is less effective in question answering compared to task-specific models. Providing context improves its performance, and prompt engineering enhances precision, particularly for questions lacking explicit answers in provided paragraphs. ChatGPT excels at simpler factual questions compared to "how" and "why" question types. The evaluation highlights occurrences of hallucinations, where ChatGPT provides responses to questions without available answers in the provided context.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Chatbots ; Context ; Performance assessment ; Questions</subject><ispartof>arXiv.org, 2023-12</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Bahak, Hossein</creatorcontrib><creatorcontrib>Taheri, Farzaneh</creatorcontrib><creatorcontrib>Zojaji, Zahra</creatorcontrib><creatorcontrib>Kazemi, Arefeh</creatorcontrib><title>Evaluating ChatGPT as a Question Answering System: A Comprehensive Analysis and Comparison with Existing Models</title><title>arXiv.org</title><description>In the current era, a multitude of language models has emerged to cater to user inquiries. Notably, the GPT-3.5 Turbo language model has gained substantial attention as the underlying technology for ChatGPT. Leveraging extensive parameters, this model adeptly responds to a wide range of questions. However, due to its reliance on internal knowledge, the accuracy of responses may not be absolute. This article scrutinizes ChatGPT as a Question Answering System (QAS), comparing its performance to other existing QASs. The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs, a core QAS capability. Additionally, performance comparisons are made in scenarios without a surrounding passage. Multiple experiments, exploring response hallucination and considering question complexity, were conducted on ChatGPT. Evaluation employed well-known Question Answering (QA) datasets, including SQuAD, NewsQA, and PersianQuAD, across English and Persian languages. Metrics such as F-score, exact match, and accuracy were employed in the assessment. The study reveals that, while ChatGPT demonstrates competence as a generative model, it is less effective in question answering compared to task-specific models. Providing context improves its performance, and prompt engineering enhances precision, particularly for questions lacking explicit answers in provided paragraphs. ChatGPT excels at simpler factual questions compared to "how" and "why" question types. The evaluation highlights occurrences of hallucinations, where ChatGPT provides responses to questions without available answers in the provided context.</description><subject>Accuracy</subject><subject>Chatbots</subject><subject>Context</subject><subject>Performance assessment</subject><subject>Questions</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjd8KgjAchUcQJOU7DLoW5vxTdididRMUeS8DV050s_2m5tun0QN0dS7Od76zQBb1PNfZ-5SukA1QEUJouKNB4FlIpT2rO2aEfOKkZOZ0zTADzPCt42CEkjiWMHA99_cRDG8OOMaJalrNSy5B9HwiWD2CmFay-FZMC5iWgzAlTt8CvvaLKngNG7R8sBq4_cs12h7TLDk7rVav-TKvVKcnIeQ0Im5IosCPvP-oDzU2Shk</recordid><startdate>20231211</startdate><enddate>20231211</enddate><creator>Bahak, Hossein</creator><creator>Taheri, Farzaneh</creator><creator>Zojaji, Zahra</creator><creator>Kazemi, Arefeh</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231211</creationdate><title>Evaluating ChatGPT as a Question Answering System: A Comprehensive Analysis and Comparison with Existing Models</title><author>Bahak, Hossein ; Taheri, Farzaneh ; Zojaji, Zahra ; Kazemi, Arefeh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29016095493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Accuracy</topic><topic>Chatbots</topic><topic>Context</topic><topic>Performance assessment</topic><topic>Questions</topic><toplevel>online_resources</toplevel><creatorcontrib>Bahak, Hossein</creatorcontrib><creatorcontrib>Taheri, Farzaneh</creatorcontrib><creatorcontrib>Zojaji, Zahra</creatorcontrib><creatorcontrib>Kazemi, Arefeh</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bahak, Hossein</au><au>Taheri, Farzaneh</au><au>Zojaji, Zahra</au><au>Kazemi, Arefeh</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Evaluating ChatGPT as a Question Answering System: A Comprehensive Analysis and Comparison with Existing Models</atitle><jtitle>arXiv.org</jtitle><date>2023-12-11</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>In the current era, a multitude of language models has emerged to cater to user inquiries. Notably, the GPT-3.5 Turbo language model has gained substantial attention as the underlying technology for ChatGPT. Leveraging extensive parameters, this model adeptly responds to a wide range of questions. However, due to its reliance on internal knowledge, the accuracy of responses may not be absolute. This article scrutinizes ChatGPT as a Question Answering System (QAS), comparing its performance to other existing QASs. The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs, a core QAS capability. Additionally, performance comparisons are made in scenarios without a surrounding passage. Multiple experiments, exploring response hallucination and considering question complexity, were conducted on ChatGPT. Evaluation employed well-known Question Answering (QA) datasets, including SQuAD, NewsQA, and PersianQuAD, across English and Persian languages. Metrics such as F-score, exact match, and accuracy were employed in the assessment. The study reveals that, while ChatGPT demonstrates competence as a generative model, it is less effective in question answering compared to task-specific models. Providing context improves its performance, and prompt engineering enhances precision, particularly for questions lacking explicit answers in provided paragraphs. ChatGPT excels at simpler factual questions compared to "how" and "why" question types. The evaluation highlights occurrences of hallucinations, where ChatGPT provides responses to questions without available answers in the provided context.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2901609549 |
source | Free E- Journals |
subjects | Accuracy Chatbots Context Performance assessment Questions |
title | Evaluating ChatGPT as a Question Answering System: A Comprehensive Analysis and Comparison with Existing Models |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T11%3A06%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Evaluating%20ChatGPT%20as%20a%20Question%20Answering%20System:%20A%20Comprehensive%20Analysis%20and%20Comparison%20with%20Existing%20Models&rft.jtitle=arXiv.org&rft.au=Bahak,%20Hossein&rft.date=2023-12-11&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2901609549%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2901609549&rft_id=info:pmid/&rfr_iscdi=true |