PersianQuAD: The Native Question Answering Dataset for the Persian Language

Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Ma...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2022, Vol.10, p.26045-26057
Hauptverfasser:	Kazemi, Arefeh, Mozafari, Jamshid, Nematbakhsh, Mohammad Ali
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Buildings Dataset Datasets Deep learning Encyclopedias Human performance Internet machine reading comprehension Machine translation natural language processing Online services Persian Persian language Quality assurance question answering Questions Task analysis Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	26057
container_issue
container_start_page	26045
container_title	IEEE access
container_volume	10
creator	Kazemi, Arefeh Mozafari, Jamshid Nematbakhsh, Mohammad Ali
description	Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.
doi_str_mv	10.1109/ACCESS.2022.3157289
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_ACCESS_2022_3157289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9729745</ieee_id><doaj_id>oai_doaj_org_article_9959a8ad9aa446c1b95ae7cee50edb23</doaj_id><sourcerecordid>2639932975</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</originalsourceid><addsrcrecordid>eNpNUU1PwkAQbYwmEuQXcGniubjf3fVGAJVIVAKeN8N2Wkuwxd2i8d9bLDHOZSYv772ZyYuiISUjSom5GU8ms9VqxAhjI05lyrQ5i3qMKpNwydX5v_kyGoSwJW3pFpJpL3p8QR9KqJaH8fQ2Xr9h_ARN-Ynx8oChKesqHlfhC31ZFfEUGgjYxHnt46ZlnqTxAqriAAVeRRc57AIOTr0fvd7N1pOHZPF8P5-MF4kTRDdJrqRxRNGMmCwTnAjFhKYbSlKJGlWqWe5yIrSkzMFGU2NEzpymDoVkQBXvR_PON6tha_e-fAf_bWso7S9Q-8KCb0q3Q2uMNKAhMwBCKEc3RgKmDlESzDaMt17Xndfe1x_Hl-22PviqPd8yxY3hzKSyZfGO5Xwdgsf8bysl9hiC7UKwxxDsKYRWNexUJSL-KUzaWgrJfwDWgoEL</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2639932975</pqid></control><display><type>article</type><title>PersianQuAD: The Native Question Answering Dataset for the Persian Language</title><source>Directory of Open Access Journals - May need to register for free articles</source><source>IEEE Xplore Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Kazemi, Arefeh ; Mozafari, Jamshid ; Nematbakhsh, Mohammad Ali</creator><creatorcontrib>Kazemi, Arefeh ; Mozafari, Jamshid ; Nematbakhsh, Mohammad Ali</creatorcontrib><description>Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2022.3157289</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Artificial intelligence ; Buildings ; Dataset ; Datasets ; Deep learning ; Encyclopedias ; Human performance ; Internet ; machine reading comprehension ; Machine translation ; natural language processing ; Online services ; Persian ; Persian language ; Quality assurance ; question answering ; Questions ; Task analysis ; Training</subject><ispartof>IEEE access, 2022, Vol.10, p.26045-26057</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</citedby><cites>FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</cites><orcidid>0000-0002-8643-9713 ; 0000-0003-4850-9239 ; 0000-0002-4374-9228</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9729745$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2095,4009,27612,27902,27903,27904,54912</link.rule.ids></links><search><creatorcontrib>Kazemi, Arefeh</creatorcontrib><creatorcontrib>Mozafari, Jamshid</creatorcontrib><creatorcontrib>Nematbakhsh, Mohammad Ali</creatorcontrib><title>PersianQuAD: The Native Question Answering Dataset for the Persian Language</title><title>IEEE access</title><addtitle>Access</addtitle><description>Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.</description><subject>Artificial intelligence</subject><subject>Buildings</subject><subject>Dataset</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Encyclopedias</subject><subject>Human performance</subject><subject>Internet</subject><subject>machine reading comprehension</subject><subject>Machine translation</subject><subject>natural language processing</subject><subject>Online services</subject><subject>Persian</subject><subject>Persian language</subject><subject>Quality assurance</subject><subject>question answering</subject><subject>Questions</subject><subject>Task analysis</subject><subject>Training</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUU1PwkAQbYwmEuQXcGniubjf3fVGAJVIVAKeN8N2Wkuwxd2i8d9bLDHOZSYv772ZyYuiISUjSom5GU8ms9VqxAhjI05lyrQ5i3qMKpNwydX5v_kyGoSwJW3pFpJpL3p8QR9KqJaH8fQ2Xr9h_ARN-Ynx8oChKesqHlfhC31ZFfEUGgjYxHnt46ZlnqTxAqriAAVeRRc57AIOTr0fvd7N1pOHZPF8P5-MF4kTRDdJrqRxRNGMmCwTnAjFhKYbSlKJGlWqWe5yIrSkzMFGU2NEzpymDoVkQBXvR_PON6tha_e-fAf_bWso7S9Q-8KCb0q3Q2uMNKAhMwBCKEc3RgKmDlESzDaMt17Xndfe1x_Hl-22PviqPd8yxY3hzKSyZfGO5Xwdgsf8bysl9hiC7UKwxxDsKYRWNexUJSL-KUzaWgrJfwDWgoEL</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Kazemi, Arefeh</creator><creator>Mozafari, Jamshid</creator><creator>Nematbakhsh, Mohammad Ali</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-8643-9713</orcidid><orcidid>https://orcid.org/0000-0003-4850-9239</orcidid><orcidid>https://orcid.org/0000-0002-4374-9228</orcidid></search><sort><creationdate>2022</creationdate><title>PersianQuAD: The Native Question Answering Dataset for the Persian Language</title><author>Kazemi, Arefeh ; Mozafari, Jamshid ; Nematbakhsh, Mohammad Ali</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Artificial intelligence</topic><topic>Buildings</topic><topic>Dataset</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Encyclopedias</topic><topic>Human performance</topic><topic>Internet</topic><topic>machine reading comprehension</topic><topic>Machine translation</topic><topic>natural language processing</topic><topic>Online services</topic><topic>Persian</topic><topic>Persian language</topic><topic>Quality assurance</topic><topic>question answering</topic><topic>Questions</topic><topic>Task analysis</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kazemi, Arefeh</creatorcontrib><creatorcontrib>Mozafari, Jamshid</creatorcontrib><creatorcontrib>Nematbakhsh, Mohammad Ali</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Directory of Open Access Journals - May need to register for free articles</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kazemi, Arefeh</au><au>Mozafari, Jamshid</au><au>Nematbakhsh, Mohammad Ali</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PersianQuAD: The Native Question Answering Dataset for the Persian Language</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2022</date><risdate>2022</risdate><volume>10</volume><spage>26045</spage><epage>26057</epage><pages>26045-26057</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2022.3157289</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-8643-9713</orcidid><orcidid>https://orcid.org/0000-0003-4850-9239</orcidid><orcidid>https://orcid.org/0000-0002-4374-9228</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2022, Vol.10, p.26045-26057
issn	2169-3536 2169-3536
language	eng
recordid	cdi_crossref_primary_10_1109_ACCESS_2022_3157289
source	Directory of Open Access Journals - May need to register for free articles; IEEE Xplore Open Access Journals; EZB Electronic Journals Library
subjects	Artificial intelligence Buildings Dataset Datasets Deep learning Encyclopedias Human performance Internet machine reading comprehension Machine translation natural language processing Online services Persian Persian language Quality assurance question answering Questions Task analysis Training
title	PersianQuAD: The Native Question Answering Dataset for the Persian Language
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T15%3A22%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PersianQuAD:%20The%20Native%20Question%20Answering%20Dataset%20for%20the%20Persian%20Language&rft.jtitle=IEEE%20access&rft.au=Kazemi,%20Arefeh&rft.date=2022&rft.volume=10&rft.spage=26045&rft.epage=26057&rft.pages=26045-26057&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2022.3157289&rft_dat=%3Cproquest_cross%3E2639932975%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2639932975&rft_id=info:pmid/&rft_ieee_id=9729745&rft_doaj_id=oai_doaj_org_article_9959a8ad9aa446c1b95ae7cee50edb23&rfr_iscdi=true