AUTHORSHIP ATTRIBUTION OF TELUGU TEXTS BASED ON SYNTACTIC FEATURES AND MACHINE LEARNING TECHNIQUES
The automatic recognition of an author of a document on the basis of linguistic features of the text is known as authorship attribution and the present paper performs this on one of the very popular and largely spoken languages of India "Telugu". The present paper strongly believes that ea...
Gespeichert in:
Veröffentlicht in: | Journal of Theoretical and Applied Information Technology 2016-03, Vol.85 (1), p.95-95 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 95 |
---|---|
container_issue | 1 |
container_start_page | 95 |
container_title | Journal of Theoretical and Applied Information Technology |
container_volume | 85 |
creator | Ganapathi Raju, N V Kumar, V Vijay Rao, O Srinivasa |
description | The automatic recognition of an author of a document on the basis of linguistic features of the text is known as authorship attribution and the present paper performs this on one of the very popular and largely spoken languages of India "Telugu". The present paper strongly believes that each author has got his own unique style of writing pattern, which is the signature of that author. The author attribution is similar to text categorization based on stylistic properties that deals with properties of the form of linguistic expression as opposed to the content of a text. The present paper is based on "shallow" features such as function words frequencies and part of speech (POS). The present paper experimented with a corpus that consists editorial articles of Telugu language by different journalists. The token and lexical based features are not considered because all the documents are in a similar genre and roughly constant over the different authors. The present paper focused on the use of syntax-based (shallow) features of an author's style, and evaluated most frequently used syntactic N-gram (unigram, bi-gram and tri-gram with and without overlapping) POS tagging features after performing the preprocessing step. The present paper also computed authorship attribution by considering Avyayas (similar to stop words in English language) of Telugu language. Further the present paper integrated the above two cases (POS tagging with Avyayas) in finding authorship attribution. Modern supervised machine learning algorithms are used by the present paper to explore large feature vectors to achieve high attribution accuracy. We have achieved an average of above 85% attribution rate on all classifiers with different feature vectors. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_miscellaneous_1845795107</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>4195137041</sourcerecordid><originalsourceid>FETCH-LOGICAL-p131t-7420925b5d00518a9d0a42475f70ea3e288dfee831933e059a1aaa65069f62413</originalsourceid><addsrcrecordid>eNpdj0FPgzAAhXvQxDn9D028eCFpKaX02LECTbAotImelk5K4oJjjvH_10RPnt7hfe_lvRuwwhlmEcGc3oH7eT4glMYJpyuwF9ZUTdtV6hUKY1q1sUY1GjYFNLK2pQ3ybjq4EZ3cwmB0H9qI3KgcFlIY28oOCr2FLyKvlJawlqLVSpchlldavVnZPYDbwY2zf_zTNbCFNHkV1U2pclFHJ0zwJWJJjHhM97RHiOLM8R65JE4YHRjyjvg4y_rB-yycIMQjyh12zqUUpXwIZzBZg-ff3tN5-ln8fNl9f82ffhzd0U_LvMNZQhmnGLGAPv1DD9NyPoZ1gYoJYQxjQq6tb1Q7</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1823377113</pqid></control><display><type>article</type><title>AUTHORSHIP ATTRIBUTION OF TELUGU TEXTS BASED ON SYNTACTIC FEATURES AND MACHINE LEARNING TECHNIQUES</title><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Ganapathi Raju, N V ; Kumar, V Vijay ; Rao, O Srinivasa</creator><creatorcontrib>Ganapathi Raju, N V ; Kumar, V Vijay ; Rao, O Srinivasa</creatorcontrib><description>The automatic recognition of an author of a document on the basis of linguistic features of the text is known as authorship attribution and the present paper performs this on one of the very popular and largely spoken languages of India "Telugu". The present paper strongly believes that each author has got his own unique style of writing pattern, which is the signature of that author. The author attribution is similar to text categorization based on stylistic properties that deals with properties of the form of linguistic expression as opposed to the content of a text. The present paper is based on "shallow" features such as function words frequencies and part of speech (POS). The present paper experimented with a corpus that consists editorial articles of Telugu language by different journalists. The token and lexical based features are not considered because all the documents are in a similar genre and roughly constant over the different authors. The present paper focused on the use of syntax-based (shallow) features of an author's style, and evaluated most frequently used syntactic N-gram (unigram, bi-gram and tri-gram with and without overlapping) POS tagging features after performing the preprocessing step. The present paper also computed authorship attribution by considering Avyayas (similar to stop words in English language) of Telugu language. Further the present paper integrated the above two cases (POS tagging with Avyayas) in finding authorship attribution. Modern supervised machine learning algorithms are used by the present paper to explore large feature vectors to achieve high attribution accuracy. We have achieved an average of above 85% attribution rate on all classifiers with different feature vectors.</description><identifier>ISSN: 1817-3195</identifier><language>eng</language><publisher>Islamabad: Journal of Theoretical and Applied Information</publisher><subject>Algorithms ; Authoring ; Constants ; Languages ; Linguistics ; Machine learning ; Marking ; Texts</subject><ispartof>Journal of Theoretical and Applied Information Technology, 2016-03, Vol.85 (1), p.95-95</ispartof><rights>Copyright Journal of Theoretical and Applied Information Mar 2016</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784</link.rule.ids></links><search><creatorcontrib>Ganapathi Raju, N V</creatorcontrib><creatorcontrib>Kumar, V Vijay</creatorcontrib><creatorcontrib>Rao, O Srinivasa</creatorcontrib><title>AUTHORSHIP ATTRIBUTION OF TELUGU TEXTS BASED ON SYNTACTIC FEATURES AND MACHINE LEARNING TECHNIQUES</title><title>Journal of Theoretical and Applied Information Technology</title><description>The automatic recognition of an author of a document on the basis of linguistic features of the text is known as authorship attribution and the present paper performs this on one of the very popular and largely spoken languages of India "Telugu". The present paper strongly believes that each author has got his own unique style of writing pattern, which is the signature of that author. The author attribution is similar to text categorization based on stylistic properties that deals with properties of the form of linguistic expression as opposed to the content of a text. The present paper is based on "shallow" features such as function words frequencies and part of speech (POS). The present paper experimented with a corpus that consists editorial articles of Telugu language by different journalists. The token and lexical based features are not considered because all the documents are in a similar genre and roughly constant over the different authors. The present paper focused on the use of syntax-based (shallow) features of an author's style, and evaluated most frequently used syntactic N-gram (unigram, bi-gram and tri-gram with and without overlapping) POS tagging features after performing the preprocessing step. The present paper also computed authorship attribution by considering Avyayas (similar to stop words in English language) of Telugu language. Further the present paper integrated the above two cases (POS tagging with Avyayas) in finding authorship attribution. Modern supervised machine learning algorithms are used by the present paper to explore large feature vectors to achieve high attribution accuracy. We have achieved an average of above 85% attribution rate on all classifiers with different feature vectors.</description><subject>Algorithms</subject><subject>Authoring</subject><subject>Constants</subject><subject>Languages</subject><subject>Linguistics</subject><subject>Machine learning</subject><subject>Marking</subject><subject>Texts</subject><issn>1817-3195</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><recordid>eNpdj0FPgzAAhXvQxDn9D028eCFpKaX02LECTbAotImelk5K4oJjjvH_10RPnt7hfe_lvRuwwhlmEcGc3oH7eT4glMYJpyuwF9ZUTdtV6hUKY1q1sUY1GjYFNLK2pQ3ybjq4EZ3cwmB0H9qI3KgcFlIY28oOCr2FLyKvlJawlqLVSpchlldavVnZPYDbwY2zf_zTNbCFNHkV1U2pclFHJ0zwJWJJjHhM97RHiOLM8R65JE4YHRjyjvg4y_rB-yycIMQjyh12zqUUpXwIZzBZg-ff3tN5-ln8fNl9f82ffhzd0U_LvMNZQhmnGLGAPv1DD9NyPoZ1gYoJYQxjQq6tb1Q7</recordid><startdate>20160301</startdate><enddate>20160301</enddate><creator>Ganapathi Raju, N V</creator><creator>Kumar, V Vijay</creator><creator>Rao, O Srinivasa</creator><general>Journal of Theoretical and Applied Information</general><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20160301</creationdate><title>AUTHORSHIP ATTRIBUTION OF TELUGU TEXTS BASED ON SYNTACTIC FEATURES AND MACHINE LEARNING TECHNIQUES</title><author>Ganapathi Raju, N V ; Kumar, V Vijay ; Rao, O Srinivasa</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p131t-7420925b5d00518a9d0a42475f70ea3e288dfee831933e059a1aaa65069f62413</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Algorithms</topic><topic>Authoring</topic><topic>Constants</topic><topic>Languages</topic><topic>Linguistics</topic><topic>Machine learning</topic><topic>Marking</topic><topic>Texts</topic><toplevel>online_resources</toplevel><creatorcontrib>Ganapathi Raju, N V</creatorcontrib><creatorcontrib>Kumar, V Vijay</creatorcontrib><creatorcontrib>Rao, O Srinivasa</creatorcontrib><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Journal of Theoretical and Applied Information Technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ganapathi Raju, N V</au><au>Kumar, V Vijay</au><au>Rao, O Srinivasa</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>AUTHORSHIP ATTRIBUTION OF TELUGU TEXTS BASED ON SYNTACTIC FEATURES AND MACHINE LEARNING TECHNIQUES</atitle><jtitle>Journal of Theoretical and Applied Information Technology</jtitle><date>2016-03-01</date><risdate>2016</risdate><volume>85</volume><issue>1</issue><spage>95</spage><epage>95</epage><pages>95-95</pages><issn>1817-3195</issn><abstract>The automatic recognition of an author of a document on the basis of linguistic features of the text is known as authorship attribution and the present paper performs this on one of the very popular and largely spoken languages of India "Telugu". The present paper strongly believes that each author has got his own unique style of writing pattern, which is the signature of that author. The author attribution is similar to text categorization based on stylistic properties that deals with properties of the form of linguistic expression as opposed to the content of a text. The present paper is based on "shallow" features such as function words frequencies and part of speech (POS). The present paper experimented with a corpus that consists editorial articles of Telugu language by different journalists. The token and lexical based features are not considered because all the documents are in a similar genre and roughly constant over the different authors. The present paper focused on the use of syntax-based (shallow) features of an author's style, and evaluated most frequently used syntactic N-gram (unigram, bi-gram and tri-gram with and without overlapping) POS tagging features after performing the preprocessing step. The present paper also computed authorship attribution by considering Avyayas (similar to stop words in English language) of Telugu language. Further the present paper integrated the above two cases (POS tagging with Avyayas) in finding authorship attribution. Modern supervised machine learning algorithms are used by the present paper to explore large feature vectors to achieve high attribution accuracy. We have achieved an average of above 85% attribution rate on all classifiers with different feature vectors.</abstract><cop>Islamabad</cop><pub>Journal of Theoretical and Applied Information</pub><tpages>1</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1817-3195 |
ispartof | Journal of Theoretical and Applied Information Technology, 2016-03, Vol.85 (1), p.95-95 |
issn | 1817-3195 |
language | eng |
recordid | cdi_proquest_miscellaneous_1845795107 |
source | Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals |
subjects | Algorithms Authoring Constants Languages Linguistics Machine learning Marking Texts |
title | AUTHORSHIP ATTRIBUTION OF TELUGU TEXTS BASED ON SYNTACTIC FEATURES AND MACHINE LEARNING TECHNIQUES |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T16%3A55%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=AUTHORSHIP%20ATTRIBUTION%20OF%20TELUGU%20TEXTS%20BASED%20ON%20SYNTACTIC%20FEATURES%20AND%20MACHINE%20LEARNING%20TECHNIQUES&rft.jtitle=Journal%20of%20Theoretical%20and%20Applied%20Information%20Technology&rft.au=Ganapathi%20Raju,%20N%20V&rft.date=2016-03-01&rft.volume=85&rft.issue=1&rft.spage=95&rft.epage=95&rft.pages=95-95&rft.issn=1817-3195&rft_id=info:doi/&rft_dat=%3Cproquest%3E4195137041%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1823377113&rft_id=info:pmid/&rfr_iscdi=true |