Evaluating Author Attribution on Emirati Tweets

Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2020, Vol.8, p.149531-149543
Hauptverfasser: Khonji, Mahmoud, Iraqi, Youssef
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 149543
container_issue
container_start_page 149531
container_title IEEE access
container_volume 8
creator Khonji, Mahmoud
Iraqi, Youssef
description Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing n -gram-based feature extraction methods under the at least l -frequent, \texttt {dir} -directed, k -skipped n -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.
doi_str_mv 10.1109/ACCESS.2020.3016731
format Article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9167201</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9167201</ieee_id><doaj_id>oai_doaj_org_article_8d08ac527b66472690c59edc793c0d95</doaj_id><sourcerecordid>2454643359</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</originalsourceid><addsrcrecordid>eNpNUE1Lw0AQXUTBUvsLegl4TrvfyR5DiFooeGg9L_uVmtJ262aj-O_dmiIOAzPMzHtveADMEVwgBMWyqutms1lgiOGCQMQLgm7ABCMucsIIv_3X34NZ3-9hijKNWDEBy-ZTHQYVu9Muq4b47kNWxRg6PcTOn7KUzbELaZ9tv5yL_QO4a9Whd7NrnYK3p2Zbv-Tr1-dVXa1zQ2EZc4SwLh3lRpRKiba1DFKruFbGtYZqUQjisGbcICY0sVhTRXFrVCEEI9a1ZApWI6_1ai_PoTuq8C296uTvwIedVCF25uBkaWGpDMOF5pwWmAtomHDWJA0DbeKbgseR6xz8x-D6KPd-CKf0vsSUUU4JYSJdkfHKBN_3wbV_qgjKi9FyNFpejJZXoxNqPqI659wfQqQlhoj8ABFNeHg</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2454643359</pqid></control><display><type>article</type><title>Evaluating Author Attribution on Emirati Tweets</title><source>DOAJ (Directory of Open Access Journals)</source><source>IEEE Xplore Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Khonji, Mahmoud ; Iraqi, Youssef</creator><creatorcontrib>Khonji, Mahmoud ; Iraqi, Youssef</creatorcontrib><description><![CDATA[Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>-gram-based feature extraction methods under the at least <inline-formula> <tex-math notation="LaTeX">l </tex-math></inline-formula> -frequent, <inline-formula> <tex-math notation="LaTeX">\texttt {dir} </tex-math></inline-formula> -directed, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> -skipped <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.]]></description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2020.3016731</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>author identification ; Classification ; Compounds ; Datasets ; Digital media ; Evaluation ; Feature extraction ; Forensics ; Libraries ; Radio frequency ; Recommender systems ; Stylometry ; supervised learning ; text analysis ; Texts ; Twitter ; unsupervised learning</subject><ispartof>IEEE access, 2020, Vol.8, p.149531-149543</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</citedby><cites>FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</cites><orcidid>0000-0003-0112-2600</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9167201$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Khonji, Mahmoud</creatorcontrib><creatorcontrib>Iraqi, Youssef</creatorcontrib><title>Evaluating Author Attribution on Emirati Tweets</title><title>IEEE access</title><addtitle>Access</addtitle><description><![CDATA[Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>-gram-based feature extraction methods under the at least <inline-formula> <tex-math notation="LaTeX">l </tex-math></inline-formula> -frequent, <inline-formula> <tex-math notation="LaTeX">\texttt {dir} </tex-math></inline-formula> -directed, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> -skipped <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.]]></description><subject>author identification</subject><subject>Classification</subject><subject>Compounds</subject><subject>Datasets</subject><subject>Digital media</subject><subject>Evaluation</subject><subject>Feature extraction</subject><subject>Forensics</subject><subject>Libraries</subject><subject>Radio frequency</subject><subject>Recommender systems</subject><subject>Stylometry</subject><subject>supervised learning</subject><subject>text analysis</subject><subject>Texts</subject><subject>Twitter</subject><subject>unsupervised learning</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUE1Lw0AQXUTBUvsLegl4TrvfyR5DiFooeGg9L_uVmtJ262aj-O_dmiIOAzPMzHtveADMEVwgBMWyqutms1lgiOGCQMQLgm7ABCMucsIIv_3X34NZ3-9hijKNWDEBy-ZTHQYVu9Muq4b47kNWxRg6PcTOn7KUzbELaZ9tv5yL_QO4a9Whd7NrnYK3p2Zbv-Tr1-dVXa1zQ2EZc4SwLh3lRpRKiba1DFKruFbGtYZqUQjisGbcICY0sVhTRXFrVCEEI9a1ZApWI6_1ai_PoTuq8C296uTvwIedVCF25uBkaWGpDMOF5pwWmAtomHDWJA0DbeKbgseR6xz8x-D6KPd-CKf0vsSUUU4JYSJdkfHKBN_3wbV_qgjKi9FyNFpejJZXoxNqPqI659wfQqQlhoj8ABFNeHg</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Khonji, Mahmoud</creator><creator>Iraqi, Youssef</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-0112-2600</orcidid></search><sort><creationdate>2020</creationdate><title>Evaluating Author Attribution on Emirati Tweets</title><author>Khonji, Mahmoud ; Iraqi, Youssef</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>author identification</topic><topic>Classification</topic><topic>Compounds</topic><topic>Datasets</topic><topic>Digital media</topic><topic>Evaluation</topic><topic>Feature extraction</topic><topic>Forensics</topic><topic>Libraries</topic><topic>Radio frequency</topic><topic>Recommender systems</topic><topic>Stylometry</topic><topic>supervised learning</topic><topic>text analysis</topic><topic>Texts</topic><topic>Twitter</topic><topic>unsupervised learning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Khonji, Mahmoud</creatorcontrib><creatorcontrib>Iraqi, Youssef</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ (Directory of Open Access Journals)</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Khonji, Mahmoud</au><au>Iraqi, Youssef</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evaluating Author Attribution on Emirati Tweets</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2020</date><risdate>2020</risdate><volume>8</volume><spage>149531</spage><epage>149543</epage><pages>149531-149543</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract><![CDATA[Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>-gram-based feature extraction methods under the at least <inline-formula> <tex-math notation="LaTeX">l </tex-math></inline-formula> -frequent, <inline-formula> <tex-math notation="LaTeX">\texttt {dir} </tex-math></inline-formula> -directed, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> -skipped <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.]]></abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2020.3016731</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0003-0112-2600</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2020, Vol.8, p.149531-149543
issn 2169-3536
2169-3536
language eng
recordid cdi_ieee_primary_9167201
source DOAJ (Directory of Open Access Journals); IEEE Xplore Open Access Journals; EZB Electronic Journals Library
subjects author identification
Classification
Compounds
Datasets
Digital media
Evaluation
Feature extraction
Forensics
Libraries
Radio frequency
Recommender systems
Stylometry
supervised learning
text analysis
Texts
Twitter
unsupervised learning
title Evaluating Author Attribution on Emirati Tweets
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T12%3A05%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evaluating%20Author%20Attribution%20on%20Emirati%20Tweets&rft.jtitle=IEEE%20access&rft.au=Khonji,%20Mahmoud&rft.date=2020&rft.volume=8&rft.spage=149531&rft.epage=149543&rft.pages=149531-149543&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2020.3016731&rft_dat=%3Cproquest_ieee_%3E2454643359%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2454643359&rft_id=info:pmid/&rft_ieee_id=9167201&rft_doaj_id=oai_doaj_org_article_8d08ac527b66472690c59edc793c0d95&rfr_iscdi=true