AUTHORSHIP ATTRIBUTION OF TELUGU TEXTS BASED ON SYNTACTIC FEATURES AND MACHINE LEARNING TECHNIQUES

The automatic recognition of an author of a document on the basis of linguistic features of the text is known as authorship attribution and the present paper performs this on one of the very popular and largely spoken languages of India "Telugu". The present paper strongly believes that ea...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of Theoretical and Applied Information Technology 2016-03, Vol.85 (1), p.95-95
Hauptverfasser:	Ganapathi Raju, N V, Kumar, V Vijay, Rao, O Srinivasa
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Authoring Constants Languages Linguistics Machine learning Marking Texts
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The automatic recognition of an author of a document on the basis of linguistic features of the text is known as authorship attribution and the present paper performs this on one of the very popular and largely spoken languages of India "Telugu". The present paper strongly believes that each author has got his own unique style of writing pattern, which is the signature of that author. The author attribution is similar to text categorization based on stylistic properties that deals with properties of the form of linguistic expression as opposed to the content of a text. The present paper is based on "shallow" features such as function words frequencies and part of speech (POS). The present paper experimented with a corpus that consists editorial articles of Telugu language by different journalists. The token and lexical based features are not considered because all the documents are in a similar genre and roughly constant over the different authors. The present paper focused on the use of syntax-based (shallow) features of an author's style, and evaluated most frequently used syntactic N-gram (unigram, bi-gram and tri-gram with and without overlapping) POS tagging features after performing the preprocessing step. The present paper also computed authorship attribution by considering Avyayas (similar to stop words in English language) of Telugu language. Further the present paper integrated the above two cases (POS tagging with Avyayas) in finding authorship attribution. Modern supervised machine learning algorithms are used by the present paper to explore large feature vectors to achieve high attribution accuracy. We have achieved an average of above 85% attribution rate on all classifiers with different feature vectors.
ISSN:	1817-3195