PersianQuAD: The Native Question Answering Dataset for the Persian Language
Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Ma...
Gespeichert in:
Veröffentlicht in: | IEEE access 2022, Vol.10, p.26045-26057 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 26057 |
---|---|
container_issue | |
container_start_page | 26045 |
container_title | IEEE access |
container_volume | 10 |
creator | Kazemi, Arefeh Mozafari, Jamshid Nematbakhsh, Mohammad Ali |
description | Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available. |
doi_str_mv | 10.1109/ACCESS.2022.3157289 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_ACCESS_2022_3157289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9729745</ieee_id><doaj_id>oai_doaj_org_article_9959a8ad9aa446c1b95ae7cee50edb23</doaj_id><sourcerecordid>2639932975</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</originalsourceid><addsrcrecordid>eNpNUU1PwkAQbYwmEuQXcGniubjf3fVGAJVIVAKeN8N2Wkuwxd2i8d9bLDHOZSYv772ZyYuiISUjSom5GU8ms9VqxAhjI05lyrQ5i3qMKpNwydX5v_kyGoSwJW3pFpJpL3p8QR9KqJaH8fQ2Xr9h_ARN-Ynx8oChKesqHlfhC31ZFfEUGgjYxHnt46ZlnqTxAqriAAVeRRc57AIOTr0fvd7N1pOHZPF8P5-MF4kTRDdJrqRxRNGMmCwTnAjFhKYbSlKJGlWqWe5yIrSkzMFGU2NEzpymDoVkQBXvR_PON6tha_e-fAf_bWso7S9Q-8KCb0q3Q2uMNKAhMwBCKEc3RgKmDlESzDaMt17Xndfe1x_Hl-22PviqPd8yxY3hzKSyZfGO5Xwdgsf8bysl9hiC7UKwxxDsKYRWNexUJSL-KUzaWgrJfwDWgoEL</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2639932975</pqid></control><display><type>article</type><title>PersianQuAD: The Native Question Answering Dataset for the Persian Language</title><source>Directory of Open Access Journals - May need to register for free articles</source><source>IEEE Xplore Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Kazemi, Arefeh ; Mozafari, Jamshid ; Nematbakhsh, Mohammad Ali</creator><creatorcontrib>Kazemi, Arefeh ; Mozafari, Jamshid ; Nematbakhsh, Mohammad Ali</creatorcontrib><description>Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2022.3157289</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Artificial intelligence ; Buildings ; Dataset ; Datasets ; Deep learning ; Encyclopedias ; Human performance ; Internet ; machine reading comprehension ; Machine translation ; natural language processing ; Online services ; Persian ; Persian language ; Quality assurance ; question answering ; Questions ; Task analysis ; Training</subject><ispartof>IEEE access, 2022, Vol.10, p.26045-26057</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</citedby><cites>FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</cites><orcidid>0000-0002-8643-9713 ; 0000-0003-4850-9239 ; 0000-0002-4374-9228</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9729745$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2095,4009,27612,27902,27903,27904,54912</link.rule.ids></links><search><creatorcontrib>Kazemi, Arefeh</creatorcontrib><creatorcontrib>Mozafari, Jamshid</creatorcontrib><creatorcontrib>Nematbakhsh, Mohammad Ali</creatorcontrib><title>PersianQuAD: The Native Question Answering Dataset for the Persian Language</title><title>IEEE access</title><addtitle>Access</addtitle><description>Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.</description><subject>Artificial intelligence</subject><subject>Buildings</subject><subject>Dataset</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Encyclopedias</subject><subject>Human performance</subject><subject>Internet</subject><subject>machine reading comprehension</subject><subject>Machine translation</subject><subject>natural language processing</subject><subject>Online services</subject><subject>Persian</subject><subject>Persian language</subject><subject>Quality assurance</subject><subject>question answering</subject><subject>Questions</subject><subject>Task analysis</subject><subject>Training</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUU1PwkAQbYwmEuQXcGniubjf3fVGAJVIVAKeN8N2Wkuwxd2i8d9bLDHOZSYv772ZyYuiISUjSom5GU8ms9VqxAhjI05lyrQ5i3qMKpNwydX5v_kyGoSwJW3pFpJpL3p8QR9KqJaH8fQ2Xr9h_ARN-Ynx8oChKesqHlfhC31ZFfEUGgjYxHnt46ZlnqTxAqriAAVeRRc57AIOTr0fvd7N1pOHZPF8P5-MF4kTRDdJrqRxRNGMmCwTnAjFhKYbSlKJGlWqWe5yIrSkzMFGU2NEzpymDoVkQBXvR_PON6tha_e-fAf_bWso7S9Q-8KCb0q3Q2uMNKAhMwBCKEc3RgKmDlESzDaMt17Xndfe1x_Hl-22PviqPd8yxY3hzKSyZfGO5Xwdgsf8bysl9hiC7UKwxxDsKYRWNexUJSL-KUzaWgrJfwDWgoEL</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Kazemi, Arefeh</creator><creator>Mozafari, Jamshid</creator><creator>Nematbakhsh, Mohammad Ali</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-8643-9713</orcidid><orcidid>https://orcid.org/0000-0003-4850-9239</orcidid><orcidid>https://orcid.org/0000-0002-4374-9228</orcidid></search><sort><creationdate>2022</creationdate><title>PersianQuAD: The Native Question Answering Dataset for the Persian Language</title><author>Kazemi, Arefeh ; Mozafari, Jamshid ; Nematbakhsh, Mohammad Ali</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-f659c061d09dd430462481b1075e8e6782fcf048512cab81994f2c81ce452a163</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Artificial intelligence</topic><topic>Buildings</topic><topic>Dataset</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Encyclopedias</topic><topic>Human performance</topic><topic>Internet</topic><topic>machine reading comprehension</topic><topic>Machine translation</topic><topic>natural language processing</topic><topic>Online services</topic><topic>Persian</topic><topic>Persian language</topic><topic>Quality assurance</topic><topic>question answering</topic><topic>Questions</topic><topic>Task analysis</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kazemi, Arefeh</creatorcontrib><creatorcontrib>Mozafari, Jamshid</creatorcontrib><creatorcontrib>Nematbakhsh, Mohammad Ali</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Directory of Open Access Journals - May need to register for free articles</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kazemi, Arefeh</au><au>Mozafari, Jamshid</au><au>Nematbakhsh, Mohammad Ali</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PersianQuAD: The Native Question Answering Dataset for the Persian Language</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2022</date><risdate>2022</risdate><volume>10</volume><spage>26045</spage><epage>26057</epage><pages>26045-26057</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2022.3157289</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-8643-9713</orcidid><orcidid>https://orcid.org/0000-0003-4850-9239</orcidid><orcidid>https://orcid.org/0000-0002-4374-9228</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2169-3536 |
ispartof | IEEE access, 2022, Vol.10, p.26045-26057 |
issn | 2169-3536 2169-3536 |
language | eng |
recordid | cdi_crossref_primary_10_1109_ACCESS_2022_3157289 |
source | Directory of Open Access Journals - May need to register for free articles; IEEE Xplore Open Access Journals; EZB Electronic Journals Library |
subjects | Artificial intelligence Buildings Dataset Datasets Deep learning Encyclopedias Human performance Internet machine reading comprehension Machine translation natural language processing Online services Persian Persian language Quality assurance question answering Questions Task analysis Training |
title | PersianQuAD: The Native Question Answering Dataset for the Persian Language |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T15%3A22%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PersianQuAD:%20The%20Native%20Question%20Answering%20Dataset%20for%20the%20Persian%20Language&rft.jtitle=IEEE%20access&rft.au=Kazemi,%20Arefeh&rft.date=2022&rft.volume=10&rft.spage=26045&rft.epage=26057&rft.pages=26045-26057&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2022.3157289&rft_dat=%3Cproquest_cross%3E2639932975%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2639932975&rft_id=info:pmid/&rft_ieee_id=9729745&rft_doaj_id=oai_doaj_org_article_9959a8ad9aa446c1b95ae7cee50edb23&rfr_iscdi=true |