A Survey of Relevant Text Mining Technology
Recent advances in text mining and natural language processing technology have enabled researchers to detect an authors identity or demographic characteristics, such as age and gender, in several text genres by automatically analysing the variation of linguistic characteristics. However, applying su...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2022-11 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Peersman, Claudia Edwards, Matthew Williams, Emma Rashid, Awais |
description | Recent advances in text mining and natural language processing technology have enabled researchers to detect an authors identity or demographic characteristics, such as age and gender, in several text genres by automatically analysing the variation of linguistic characteristics. However, applying such techniques in the wild, i.e., in both cybercriminal and regular online social media, differs from more general applications in that its defining characteristics are both domain and process dependent. This gives rise to a number of challenges of which contemporary research has only scratched the surface. More specifically, a text mining approach applied on social media communications typically has no control over the dataset size, the number of available communications will vary across users. Hence, the system has to be robust towards limited data availability. Additionally, the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant to a certain degree of linguistic noise (for example, abbreviations, non-standard language use, spelling variations and errors). Finally, in the context of cybercriminal fora, it has to be robust towards deceptive or adversarial behaviour, i.e. offenders who attempt to hide their criminal intentions (obfuscation) or who assume a false digital persona (imitation), potentially using coded language. In this work we present a comprehensive survey that discusses the problems that have already been addressed in current literature and review potential solutions. Additionally, we highlight which areas need to be given more attention. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2742868135</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2742868135</sourcerecordid><originalsourceid>FETCH-proquest_journals_27428681353</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQdlQILi0qS61UyE9TCErNSS1LzCtRCEmtKFHwzczLzEsHspMz8vJz8tMreRhY0xJzilN5oTQ3g7Kba4izh25BUX5haWpxSXxWfmlRHlAq3sjcxMjCzMLQ2NSYOFUAEXgw1A</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2742868135</pqid></control><display><type>article</type><title>A Survey of Relevant Text Mining Technology</title><source>Freely Accessible Journals</source><creator>Peersman, Claudia ; Edwards, Matthew ; Williams, Emma ; Rashid, Awais</creator><creatorcontrib>Peersman, Claudia ; Edwards, Matthew ; Williams, Emma ; Rashid, Awais</creatorcontrib><description>Recent advances in text mining and natural language processing technology have enabled researchers to detect an authors identity or demographic characteristics, such as age and gender, in several text genres by automatically analysing the variation of linguistic characteristics. However, applying such techniques in the wild, i.e., in both cybercriminal and regular online social media, differs from more general applications in that its defining characteristics are both domain and process dependent. This gives rise to a number of challenges of which contemporary research has only scratched the surface. More specifically, a text mining approach applied on social media communications typically has no control over the dataset size, the number of available communications will vary across users. Hence, the system has to be robust towards limited data availability. Additionally, the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant to a certain degree of linguistic noise (for example, abbreviations, non-standard language use, spelling variations and errors). Finally, in the context of cybercriminal fora, it has to be robust towards deceptive or adversarial behaviour, i.e. offenders who attempt to hide their criminal intentions (obfuscation) or who assume a false digital persona (imitation), potentially using coded language. In this work we present a comprehensive survey that discusses the problems that have already been addressed in current literature and review potential solutions. Additionally, we highlight which areas need to be given more attention.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Abbreviations ; Availability ; Crime ; Data mining ; Digital media ; Linguistics ; Literature reviews ; Natural language processing ; Robustness ; Social networks</subject><ispartof>arXiv.org, 2022-11</ispartof><rights>2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Peersman, Claudia</creatorcontrib><creatorcontrib>Edwards, Matthew</creatorcontrib><creatorcontrib>Williams, Emma</creatorcontrib><creatorcontrib>Rashid, Awais</creatorcontrib><title>A Survey of Relevant Text Mining Technology</title><title>arXiv.org</title><description>Recent advances in text mining and natural language processing technology have enabled researchers to detect an authors identity or demographic characteristics, such as age and gender, in several text genres by automatically analysing the variation of linguistic characteristics. However, applying such techniques in the wild, i.e., in both cybercriminal and regular online social media, differs from more general applications in that its defining characteristics are both domain and process dependent. This gives rise to a number of challenges of which contemporary research has only scratched the surface. More specifically, a text mining approach applied on social media communications typically has no control over the dataset size, the number of available communications will vary across users. Hence, the system has to be robust towards limited data availability. Additionally, the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant to a certain degree of linguistic noise (for example, abbreviations, non-standard language use, spelling variations and errors). Finally, in the context of cybercriminal fora, it has to be robust towards deceptive or adversarial behaviour, i.e. offenders who attempt to hide their criminal intentions (obfuscation) or who assume a false digital persona (imitation), potentially using coded language. In this work we present a comprehensive survey that discusses the problems that have already been addressed in current literature and review potential solutions. Additionally, we highlight which areas need to be given more attention.</description><subject>Abbreviations</subject><subject>Availability</subject><subject>Crime</subject><subject>Data mining</subject><subject>Digital media</subject><subject>Linguistics</subject><subject>Literature reviews</subject><subject>Natural language processing</subject><subject>Robustness</subject><subject>Social networks</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mTQdlQILi0qS61UyE9TCErNSS1LzCtRCEmtKFHwzczLzEsHspMz8vJz8tMreRhY0xJzilN5oTQ3g7Kba4izh25BUX5haWpxSXxWfmlRHlAq3sjcxMjCzMLQ2NSYOFUAEXgw1A</recordid><startdate>20221128</startdate><enddate>20221128</enddate><creator>Peersman, Claudia</creator><creator>Edwards, Matthew</creator><creator>Williams, Emma</creator><creator>Rashid, Awais</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20221128</creationdate><title>A Survey of Relevant Text Mining Technology</title><author>Peersman, Claudia ; Edwards, Matthew ; Williams, Emma ; Rashid, Awais</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27428681353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Abbreviations</topic><topic>Availability</topic><topic>Crime</topic><topic>Data mining</topic><topic>Digital media</topic><topic>Linguistics</topic><topic>Literature reviews</topic><topic>Natural language processing</topic><topic>Robustness</topic><topic>Social networks</topic><toplevel>online_resources</toplevel><creatorcontrib>Peersman, Claudia</creatorcontrib><creatorcontrib>Edwards, Matthew</creatorcontrib><creatorcontrib>Williams, Emma</creatorcontrib><creatorcontrib>Rashid, Awais</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Peersman, Claudia</au><au>Edwards, Matthew</au><au>Williams, Emma</au><au>Rashid, Awais</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A Survey of Relevant Text Mining Technology</atitle><jtitle>arXiv.org</jtitle><date>2022-11-28</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Recent advances in text mining and natural language processing technology have enabled researchers to detect an authors identity or demographic characteristics, such as age and gender, in several text genres by automatically analysing the variation of linguistic characteristics. However, applying such techniques in the wild, i.e., in both cybercriminal and regular online social media, differs from more general applications in that its defining characteristics are both domain and process dependent. This gives rise to a number of challenges of which contemporary research has only scratched the surface. More specifically, a text mining approach applied on social media communications typically has no control over the dataset size, the number of available communications will vary across users. Hence, the system has to be robust towards limited data availability. Additionally, the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant to a certain degree of linguistic noise (for example, abbreviations, non-standard language use, spelling variations and errors). Finally, in the context of cybercriminal fora, it has to be robust towards deceptive or adversarial behaviour, i.e. offenders who attempt to hide their criminal intentions (obfuscation) or who assume a false digital persona (imitation), potentially using coded language. In this work we present a comprehensive survey that discusses the problems that have already been addressed in current literature and review potential solutions. Additionally, we highlight which areas need to be given more attention.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2022-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2742868135 |
source | Freely Accessible Journals |
subjects | Abbreviations Availability Crime Data mining Digital media Linguistics Literature reviews Natural language processing Robustness Social networks |
title | A Survey of Relevant Text Mining Technology |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T22%3A34%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20Survey%20of%20Relevant%20Text%20Mining%20Technology&rft.jtitle=arXiv.org&rft.au=Peersman,%20Claudia&rft.date=2022-11-28&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2742868135%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2742868135&rft_id=info:pmid/&rfr_iscdi=true |