Evaluating Author Attribution on Emirati Tweets
Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques...
Gespeichert in:
Veröffentlicht in: | IEEE access 2020, Vol.8, p.149531-149543 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 149543 |
---|---|
container_issue | |
container_start_page | 149531 |
container_title | IEEE access |
container_volume | 8 |
creator | Khonji, Mahmoud Iraqi, Youssef |
description | Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing n -gram-based feature extraction methods under the at least l -frequent, \texttt {dir} -directed, k -skipped n -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes. |
doi_str_mv | 10.1109/ACCESS.2020.3016731 |
format | Article |
fullrecord | <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9167201</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9167201</ieee_id><doaj_id>oai_doaj_org_article_8d08ac527b66472690c59edc793c0d95</doaj_id><sourcerecordid>2454643359</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</originalsourceid><addsrcrecordid>eNpNUE1Lw0AQXUTBUvsLegl4TrvfyR5DiFooeGg9L_uVmtJ262aj-O_dmiIOAzPMzHtveADMEVwgBMWyqutms1lgiOGCQMQLgm7ABCMucsIIv_3X34NZ3-9hijKNWDEBy-ZTHQYVu9Muq4b47kNWxRg6PcTOn7KUzbELaZ9tv5yL_QO4a9Whd7NrnYK3p2Zbv-Tr1-dVXa1zQ2EZc4SwLh3lRpRKiba1DFKruFbGtYZqUQjisGbcICY0sVhTRXFrVCEEI9a1ZApWI6_1ai_PoTuq8C296uTvwIedVCF25uBkaWGpDMOF5pwWmAtomHDWJA0DbeKbgseR6xz8x-D6KPd-CKf0vsSUUU4JYSJdkfHKBN_3wbV_qgjKi9FyNFpejJZXoxNqPqI659wfQqQlhoj8ABFNeHg</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2454643359</pqid></control><display><type>article</type><title>Evaluating Author Attribution on Emirati Tweets</title><source>DOAJ (Directory of Open Access Journals)</source><source>IEEE Xplore Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Khonji, Mahmoud ; Iraqi, Youssef</creator><creatorcontrib>Khonji, Mahmoud ; Iraqi, Youssef</creatorcontrib><description><![CDATA[Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>-gram-based feature extraction methods under the at least <inline-formula> <tex-math notation="LaTeX">l </tex-math></inline-formula> -frequent, <inline-formula> <tex-math notation="LaTeX">\texttt {dir} </tex-math></inline-formula> -directed, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> -skipped <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.]]></description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2020.3016731</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>author identification ; Classification ; Compounds ; Datasets ; Digital media ; Evaluation ; Feature extraction ; Forensics ; Libraries ; Radio frequency ; Recommender systems ; Stylometry ; supervised learning ; text analysis ; Texts ; Twitter ; unsupervised learning</subject><ispartof>IEEE access, 2020, Vol.8, p.149531-149543</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</citedby><cites>FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</cites><orcidid>0000-0003-0112-2600</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9167201$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Khonji, Mahmoud</creatorcontrib><creatorcontrib>Iraqi, Youssef</creatorcontrib><title>Evaluating Author Attribution on Emirati Tweets</title><title>IEEE access</title><addtitle>Access</addtitle><description><![CDATA[Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>-gram-based feature extraction methods under the at least <inline-formula> <tex-math notation="LaTeX">l </tex-math></inline-formula> -frequent, <inline-formula> <tex-math notation="LaTeX">\texttt {dir} </tex-math></inline-formula> -directed, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> -skipped <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.]]></description><subject>author identification</subject><subject>Classification</subject><subject>Compounds</subject><subject>Datasets</subject><subject>Digital media</subject><subject>Evaluation</subject><subject>Feature extraction</subject><subject>Forensics</subject><subject>Libraries</subject><subject>Radio frequency</subject><subject>Recommender systems</subject><subject>Stylometry</subject><subject>supervised learning</subject><subject>text analysis</subject><subject>Texts</subject><subject>Twitter</subject><subject>unsupervised learning</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUE1Lw0AQXUTBUvsLegl4TrvfyR5DiFooeGg9L_uVmtJ262aj-O_dmiIOAzPMzHtveADMEVwgBMWyqutms1lgiOGCQMQLgm7ABCMucsIIv_3X34NZ3-9hijKNWDEBy-ZTHQYVu9Muq4b47kNWxRg6PcTOn7KUzbELaZ9tv5yL_QO4a9Whd7NrnYK3p2Zbv-Tr1-dVXa1zQ2EZc4SwLh3lRpRKiba1DFKruFbGtYZqUQjisGbcICY0sVhTRXFrVCEEI9a1ZApWI6_1ai_PoTuq8C296uTvwIedVCF25uBkaWGpDMOF5pwWmAtomHDWJA0DbeKbgseR6xz8x-D6KPd-CKf0vsSUUU4JYSJdkfHKBN_3wbV_qgjKi9FyNFpejJZXoxNqPqI659wfQqQlhoj8ABFNeHg</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Khonji, Mahmoud</creator><creator>Iraqi, Youssef</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-0112-2600</orcidid></search><sort><creationdate>2020</creationdate><title>Evaluating Author Attribution on Emirati Tweets</title><author>Khonji, Mahmoud ; Iraqi, Youssef</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>author identification</topic><topic>Classification</topic><topic>Compounds</topic><topic>Datasets</topic><topic>Digital media</topic><topic>Evaluation</topic><topic>Feature extraction</topic><topic>Forensics</topic><topic>Libraries</topic><topic>Radio frequency</topic><topic>Recommender systems</topic><topic>Stylometry</topic><topic>supervised learning</topic><topic>text analysis</topic><topic>Texts</topic><topic>Twitter</topic><topic>unsupervised learning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Khonji, Mahmoud</creatorcontrib><creatorcontrib>Iraqi, Youssef</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ (Directory of Open Access Journals)</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Khonji, Mahmoud</au><au>Iraqi, Youssef</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evaluating Author Attribution on Emirati Tweets</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2020</date><risdate>2020</risdate><volume>8</volume><spage>149531</spage><epage>149543</epage><pages>149531-149543</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract><![CDATA[Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>-gram-based feature extraction methods under the at least <inline-formula> <tex-math notation="LaTeX">l </tex-math></inline-formula> -frequent, <inline-formula> <tex-math notation="LaTeX">\texttt {dir} </tex-math></inline-formula> -directed, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> -skipped <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.]]></abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2020.3016731</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0003-0112-2600</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2169-3536 |
ispartof | IEEE access, 2020, Vol.8, p.149531-149543 |
issn | 2169-3536 2169-3536 |
language | eng |
recordid | cdi_ieee_primary_9167201 |
source | DOAJ (Directory of Open Access Journals); IEEE Xplore Open Access Journals; EZB Electronic Journals Library |
subjects | author identification Classification Compounds Datasets Digital media Evaluation Feature extraction Forensics Libraries Radio frequency Recommender systems Stylometry supervised learning text analysis Texts unsupervised learning |
title | Evaluating Author Attribution on Emirati Tweets |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T12%3A05%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evaluating%20Author%20Attribution%20on%20Emirati%20Tweets&rft.jtitle=IEEE%20access&rft.au=Khonji,%20Mahmoud&rft.date=2020&rft.volume=8&rft.spage=149531&rft.epage=149543&rft.pages=149531-149543&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2020.3016731&rft_dat=%3Cproquest_ieee_%3E2454643359%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2454643359&rft_id=info:pmid/&rft_ieee_id=9167201&rft_doaj_id=oai_doaj_org_article_8d08ac527b66472690c59edc793c0d95&rfr_iscdi=true |