Evaluating Author Attribution on Emirati Tweets

Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2020, Vol.8, p.149531-149543
Hauptverfasser:	Khonji, Mahmoud, Iraqi, Youssef
Format:	Artikel
Sprache:	eng
Schlagworte:	author identification Classification Compounds Datasets Digital media Evaluation Feature extraction Forensics Libraries Radio frequency Recommender systems Stylometry supervised learning text analysis Texts Twitter unsupervised learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	149543
container_issue
container_start_page	149531
container_title	IEEE access
container_volume	8
creator	Khonji, Mahmoud Iraqi, Youssef
description	Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing n -gram-based feature extraction methods under the at least l -frequent, \texttt {dir} -directed, k -skipped n -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.
doi_str_mv	10.1109/ACCESS.2020.3016731
format	Article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9167201</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9167201</ieee_id><doaj_id>oai_doaj_org_article_8d08ac527b66472690c59edc793c0d95</doaj_id><sourcerecordid>2454643359</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</originalsourceid><addsrcrecordid>eNpNUE1Lw0AQXUTBUvsLegl4TrvfyR5DiFooeGg9L_uVmtJ262aj-O_dmiIOAzPMzHtveADMEVwgBMWyqutms1lgiOGCQMQLgm7ABCMucsIIv_3X34NZ3-9hijKNWDEBy-ZTHQYVu9Muq4b47kNWxRg6PcTOn7KUzbELaZ9tv5yL_QO4a9Whd7NrnYK3p2Zbv-Tr1-dVXa1zQ2EZc4SwLh3lRpRKiba1DFKruFbGtYZqUQjisGbcICY0sVhTRXFrVCEEI9a1ZApWI6_1ai_PoTuq8C296uTvwIedVCF25uBkaWGpDMOF5pwWmAtomHDWJA0DbeKbgseR6xz8x-D6KPd-CKf0vsSUUU4JYSJdkfHKBN_3wbV_qgjKi9FyNFpejJZXoxNqPqI659wfQqQlhoj8ABFNeHg</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2454643359</pqid></control><display><type>article</type><title>Evaluating Author Attribution on Emirati Tweets</title><source>DOAJ (Directory of Open Access Journals)</source><source>IEEE Xplore Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Khonji, Mahmoud ; Iraqi, Youssef</creator><creatorcontrib>Khonji, Mahmoud ; Iraqi, Youssef</creatorcontrib><description><![CDATA[Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>-gram-based feature extraction methods under the at least <inline-formula> <tex-math notation="LaTeX">l </tex-math></inline-formula> -frequent, <inline-formula> <tex-math notation="LaTeX">\texttt {dir} </tex-math></inline-formula> -directed, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> -skipped <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.]]></description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2020.3016731</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>author identification ; Classification ; Compounds ; Datasets ; Digital media ; Evaluation ; Feature extraction ; Forensics ; Libraries ; Radio frequency ; Recommender systems ; Stylometry ; supervised learning ; text analysis ; Texts ; Twitter ; unsupervised learning</subject><ispartof>IEEE access, 2020, Vol.8, p.149531-149543</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</citedby><cites>FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</cites><orcidid>0000-0003-0112-2600</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9167201$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Khonji, Mahmoud</creatorcontrib><creatorcontrib>Iraqi, Youssef</creatorcontrib><title>Evaluating Author Attribution on Emirati Tweets</title><title>IEEE access</title><addtitle>Access</addtitle><description><![CDATA[Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>-gram-based feature extraction methods under the at least <inline-formula> <tex-math notation="LaTeX">l </tex-math></inline-formula> -frequent, <inline-formula> <tex-math notation="LaTeX">\texttt {dir} </tex-math></inline-formula> -directed, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> -skipped <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.]]></description><subject>author identification</subject><subject>Classification</subject><subject>Compounds</subject><subject>Datasets</subject><subject>Digital media</subject><subject>Evaluation</subject><subject>Feature extraction</subject><subject>Forensics</subject><subject>Libraries</subject><subject>Radio frequency</subject><subject>Recommender systems</subject><subject>Stylometry</subject><subject>supervised learning</subject><subject>text analysis</subject><subject>Texts</subject><subject>Twitter</subject><subject>unsupervised learning</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUE1Lw0AQXUTBUvsLegl4TrvfyR5DiFooeGg9L_uVmtJ262aj-O_dmiIOAzPMzHtveADMEVwgBMWyqutms1lgiOGCQMQLgm7ABCMucsIIv_3X34NZ3-9hijKNWDEBy-ZTHQYVu9Muq4b47kNWxRg6PcTOn7KUzbELaZ9tv5yL_QO4a9Whd7NrnYK3p2Zbv-Tr1-dVXa1zQ2EZc4SwLh3lRpRKiba1DFKruFbGtYZqUQjisGbcICY0sVhTRXFrVCEEI9a1ZApWI6_1ai_PoTuq8C296uTvwIedVCF25uBkaWGpDMOF5pwWmAtomHDWJA0DbeKbgseR6xz8x-D6KPd-CKf0vsSUUU4JYSJdkfHKBN_3wbV_qgjKi9FyNFpejJZXoxNqPqI659wfQqQlhoj8ABFNeHg</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Khonji, Mahmoud</creator><creator>Iraqi, Youssef</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-0112-2600</orcidid></search><sort><creationdate>2020</creationdate><title>Evaluating Author Attribution on Emirati Tweets</title><author>Khonji, Mahmoud ; Iraqi, Youssef</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-112b8e46c98aa9ffd504da6bacefc4b9793e2b56c159b3d2b4a42fca79953def3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>author identification</topic><topic>Classification</topic><topic>Compounds</topic><topic>Datasets</topic><topic>Digital media</topic><topic>Evaluation</topic><topic>Feature extraction</topic><topic>Forensics</topic><topic>Libraries</topic><topic>Radio frequency</topic><topic>Recommender systems</topic><topic>Stylometry</topic><topic>supervised learning</topic><topic>text analysis</topic><topic>Texts</topic><topic>Twitter</topic><topic>unsupervised learning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Khonji, Mahmoud</creatorcontrib><creatorcontrib>Iraqi, Youssef</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ (Directory of Open Access Journals)</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Khonji, Mahmoud</au><au>Iraqi, Youssef</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evaluating Author Attribution on Emirati Tweets</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2020</date><risdate>2020</risdate><volume>8</volume><spage>149531</spage><epage>149543</epage><pages>149531-149543</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract><![CDATA[Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor , with a highly intuitive API. The library generalizes all existing <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>-gram-based feature extraction methods under the at least <inline-formula> <tex-math notation="LaTeX">l </tex-math></inline-formula> -frequent, <inline-formula> <tex-math notation="LaTeX">\texttt {dir} </tex-math></inline-formula> -directed, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> -skipped <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> -grams , and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.]]></abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2020.3016731</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0003-0112-2600</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2020, Vol.8, p.149531-149543
issn	2169-3536 2169-3536
language	eng
recordid	cdi_ieee_primary_9167201
source	DOAJ (Directory of Open Access Journals); IEEE Xplore Open Access Journals; EZB Electronic Journals Library
subjects	author identification Classification Compounds Datasets Digital media Evaluation Feature extraction Forensics Libraries Radio frequency Recommender systems Stylometry supervised learning text analysis Texts Twitter unsupervised learning
title	Evaluating Author Attribution on Emirati Tweets
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T12%3A05%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evaluating%20Author%20Attribution%20on%20Emirati%20Tweets&rft.jtitle=IEEE%20access&rft.au=Khonji,%20Mahmoud&rft.date=2020&rft.volume=8&rft.spage=149531&rft.epage=149543&rft.pages=149531-149543&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2020.3016731&rft_dat=%3Cproquest_ieee_%3E2454643359%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2454643359&rft_id=info:pmid/&rft_ieee_id=9167201&rft_doaj_id=oai_doaj_org_article_8d08ac527b66472690c59edc793c0d95&rfr_iscdi=true