PersianQuAD: The Native Question Answering Dataset for the Persian Language

Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Ma...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2022, Vol.10, p.26045-26057
Hauptverfasser: Kazemi, Arefeh, Mozafari, Jamshid, Nematbakhsh, Mohammad Ali
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 26057
container_issue
container_start_page 26045
container_title IEEE access
container_volume 10
creator Kazemi, Arefeh
Mozafari, Jamshid
Nematbakhsh, Mohammad Ali
description Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.
doi_str_mv 10.1109/ACCESS.2022.3157289
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_ACCESS_2022_3157289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9729745</ieee_id><doaj_id>oai_doaj_org_article_9959a8ad9aa446c1b95ae7cee50edb23</doaj_id><sourcerecordid>2639932975</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</originalsourceid><addsrcrecordid>eNpNUU1PwkAQbYwmEuQXcGniubjf3fVGAJVIVAKeN8N2Wkuwxd2i8d9bLDHOZSYv772ZyYuiISUjSom5GU8ms9VqxAhjI05lyrQ5i3qMKpNwydX5v_kyGoSwJW3pFpJpL3p8QR9KqJaH8fQ2Xr9h_ARN-Ynx8oChKesqHlfhC31ZFfEUGgjYxHnt46ZlnqTxAqriAAVeRRc57AIOTr0fvd7N1pOHZPF8P5-MF4kTRDdJrqRxRNGMmCwTnAjFhKYbSlKJGlWqWe5yIrSkzMFGU2NEzpymDoVkQBXvR_PON6tha_e-fAf_bWso7S9Q-8KCb0q3Q2uMNKAhMwBCKEc3RgKmDlESzDaMt17Xndfe1x_Hl-22PviqPd8yxY3hzKSyZfGO5Xwdgsf8bysl9hiC7UKwxxDsKYRWNexUJSL-KUzaWgrJfwDWgoEL</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2639932975</pqid></control><display><type>article</type><title>PersianQuAD: The Native Question Answering Dataset for the Persian Language</title><source>Directory of Open Access Journals - May need to register for free articles</source><source>IEEE Xplore Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Kazemi, Arefeh ; Mozafari, Jamshid ; Nematbakhsh, Mohammad Ali</creator><creatorcontrib>Kazemi, Arefeh ; Mozafari, Jamshid ; Nematbakhsh, Mohammad Ali</creatorcontrib><description>Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2022.3157289</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Artificial intelligence ; Buildings ; Dataset ; Datasets ; Deep learning ; Encyclopedias ; Human performance ; Internet ; machine reading comprehension ; Machine translation ; natural language processing ; Online services ; Persian ; Persian language ; Quality assurance ; question answering ; Questions ; Task analysis ; Training</subject><ispartof>IEEE access, 2022, Vol.10, p.26045-26057</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</citedby><cites>FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</cites><orcidid>0000-0002-8643-9713 ; 0000-0003-4850-9239 ; 0000-0002-4374-9228</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9729745$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2095,4009,27612,27902,27903,27904,54912</link.rule.ids></links><search><creatorcontrib>Kazemi, Arefeh</creatorcontrib><creatorcontrib>Mozafari, Jamshid</creatorcontrib><creatorcontrib>Nematbakhsh, Mohammad Ali</creatorcontrib><title>PersianQuAD: The Native Question Answering Dataset for the Persian Language</title><title>IEEE access</title><addtitle>Access</addtitle><description>Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.</description><subject>Artificial intelligence</subject><subject>Buildings</subject><subject>Dataset</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Encyclopedias</subject><subject>Human performance</subject><subject>Internet</subject><subject>machine reading comprehension</subject><subject>Machine translation</subject><subject>natural language processing</subject><subject>Online services</subject><subject>Persian</subject><subject>Persian language</subject><subject>Quality assurance</subject><subject>question answering</subject><subject>Questions</subject><subject>Task analysis</subject><subject>Training</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUU1PwkAQbYwmEuQXcGniubjf3fVGAJVIVAKeN8N2Wkuwxd2i8d9bLDHOZSYv772ZyYuiISUjSom5GU8ms9VqxAhjI05lyrQ5i3qMKpNwydX5v_kyGoSwJW3pFpJpL3p8QR9KqJaH8fQ2Xr9h_ARN-Ynx8oChKesqHlfhC31ZFfEUGgjYxHnt46ZlnqTxAqriAAVeRRc57AIOTr0fvd7N1pOHZPF8P5-MF4kTRDdJrqRxRNGMmCwTnAjFhKYbSlKJGlWqWe5yIrSkzMFGU2NEzpymDoVkQBXvR_PON6tha_e-fAf_bWso7S9Q-8KCb0q3Q2uMNKAhMwBCKEc3RgKmDlESzDaMt17Xndfe1x_Hl-22PviqPd8yxY3hzKSyZfGO5Xwdgsf8bysl9hiC7UKwxxDsKYRWNexUJSL-KUzaWgrJfwDWgoEL</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Kazemi, Arefeh</creator><creator>Mozafari, Jamshid</creator><creator>Nematbakhsh, Mohammad Ali</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-8643-9713</orcidid><orcidid>https://orcid.org/0000-0003-4850-9239</orcidid><orcidid>https://orcid.org/0000-0002-4374-9228</orcidid></search><sort><creationdate>2022</creationdate><title>PersianQuAD: The Native Question Answering Dataset for the Persian Language</title><author>Kazemi, Arefeh ; Mozafari, Jamshid ; Nematbakhsh, Mohammad Ali</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Artificial intelligence</topic><topic>Buildings</topic><topic>Dataset</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Encyclopedias</topic><topic>Human performance</topic><topic>Internet</topic><topic>machine reading comprehension</topic><topic>Machine translation</topic><topic>natural language processing</topic><topic>Online services</topic><topic>Persian</topic><topic>Persian language</topic><topic>Quality assurance</topic><topic>question answering</topic><topic>Questions</topic><topic>Task analysis</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kazemi, Arefeh</creatorcontrib><creatorcontrib>Mozafari, Jamshid</creatorcontrib><creatorcontrib>Nematbakhsh, Mohammad Ali</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Directory of Open Access Journals - May need to register for free articles</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kazemi, Arefeh</au><au>Mozafari, Jamshid</au><au>Nematbakhsh, Mohammad Ali</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PersianQuAD: The Native Question Answering Dataset for the Persian Language</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2022</date><risdate>2022</risdate><volume>10</volume><spage>26045</spage><epage>26057</epage><pages>26045-26057</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2022.3157289</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-8643-9713</orcidid><orcidid>https://orcid.org/0000-0003-4850-9239</orcidid><orcidid>https://orcid.org/0000-0002-4374-9228</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2022, Vol.10, p.26045-26057
issn 2169-3536
2169-3536
language eng
recordid cdi_crossref_primary_10_1109_ACCESS_2022_3157289
source Directory of Open Access Journals - May need to register for free articles; IEEE Xplore Open Access Journals; EZB Electronic Journals Library
subjects Artificial intelligence
Buildings
Dataset
Datasets
Deep learning
Encyclopedias
Human performance
Internet
machine reading comprehension
Machine translation
natural language processing
Online services
Persian
Persian language
Quality assurance
question answering
Questions
Task analysis
Training
title PersianQuAD: The Native Question Answering Dataset for the Persian Language
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T15%3A22%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PersianQuAD:%20The%20Native%20Question%20Answering%20Dataset%20for%20the%20Persian%20Language&rft.jtitle=IEEE%20access&rft.au=Kazemi,%20Arefeh&rft.date=2022&rft.volume=10&rft.spage=26045&rft.epage=26057&rft.pages=26045-26057&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2022.3157289&rft_dat=%3Cproquest_cross%3E2639932975%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2639932975&rft_id=info:pmid/&rft_ieee_id=9729745&rft_doaj_id=oai_doaj_org_article_9959a8ad9aa446c1b95ae7cee50edb23&rfr_iscdi=true