Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges

The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiariz...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information processing & management 2018-05, Vol.54 (3), p.408-432
Hauptverfasser:	K., Vani, Gupta, Deepa
Format:	Artikel
Sprache:	eng
Schlagworte:	Chunking Complexity Datasets Feature extraction Linguistics Natural language processing Plagiarism Plagiarism detection POS tagging Semantic role labelling Semantics Syntactic-semantic Systems analysis Technological change
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	432
container_issue	3
container_start_page	408
container_title	Information processing & management
container_volume	54
creator	K., Vani Gupta, Deepa
description	The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.
doi_str_mv	10.1016/j.ipm.2018.01.008
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2086367368</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457317300547</els_id><sourcerecordid>2086367368</sourcerecordid><originalsourceid>FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</originalsourceid><addsrcrecordid>eNp9kM2O1DAQhC3ESgy7-wDcLHElwY5jOwsnNFp-pJW4sGer43SyHhInuBO08wC8N46GM6cutapKn4qxN1KUUkjz_lSGZSorIZtSyFKI5gU7yMaqQisrX7KDUMIUtbbqFXtNdBJC1FpWB_bnMU5AP0Mc-IrPK19GGAKkQBPfaP_SOa7g1-ALwgliFrwFwo5HWLcEIx8hDhsMyJc0eyS6VPmnGH5tSB_4cZ6WvXCO9I5DhPFMgbLouH-CccQ4IN2wqx5Gwtt_95o9fr7_cfxaPHz_8u346aHwyqq1aL2Q2PeoOwVd3aPp7myNXt1V2kClwQrT1trXLUjvbSsARGWatrfG16rVWl2zt5fezLrTre40bykzkatEY5SxyjTZJS8un2aihL1bUpggnZ0Ubl_bnVxe2-1rOyFdXjtnPl4ymPF_B0yOfMDosQsJ_eq6Ofwn_ReJzIwP</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2086367368</pqid></control><display><type>article</type><title>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</title><source>Elsevier ScienceDirect Journals Complete</source><creator>K., Vani ; Gupta, Deepa</creator><creatorcontrib>K., Vani ; Gupta, Deepa</creatorcontrib><description>The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2018.01.008</identifier><language>eng</language><publisher>Oxford: Elsevier Ltd</publisher><subject>Chunking ; Complexity ; Datasets ; Feature extraction ; Linguistics ; Natural language processing ; Plagiarism ; Plagiarism detection ; POS tagging ; Semantic role labelling ; Semantics ; Syntactic-semantic ; Systems analysis ; Technological change</subject><ispartof>Information processing & management, 2018-05, Vol.54 (3), p.408-432</ispartof><rights>2018 Elsevier Ltd</rights><rights>Copyright Pergamon Press Inc. May 2018</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</citedby><cites>FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ipm.2018.01.008$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,776,780,3536,27903,27904,45974</link.rule.ids></links><search><creatorcontrib>K., Vani</creatorcontrib><creatorcontrib>Gupta, Deepa</creatorcontrib><title>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</title><title>Information processing & management</title><description>The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.</description><subject>Chunking</subject><subject>Complexity</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Linguistics</subject><subject>Natural language processing</subject><subject>Plagiarism</subject><subject>Plagiarism detection</subject><subject>POS tagging</subject><subject>Semantic role labelling</subject><subject>Semantics</subject><subject>Syntactic-semantic</subject><subject>Systems analysis</subject><subject>Technological change</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNp9kM2O1DAQhC3ESgy7-wDcLHElwY5jOwsnNFp-pJW4sGer43SyHhInuBO08wC8N46GM6cutapKn4qxN1KUUkjz_lSGZSorIZtSyFKI5gU7yMaqQisrX7KDUMIUtbbqFXtNdBJC1FpWB_bnMU5AP0Mc-IrPK19GGAKkQBPfaP_SOa7g1-ALwgliFrwFwo5HWLcEIx8hDhsMyJc0eyS6VPmnGH5tSB_4cZ6WvXCO9I5DhPFMgbLouH-CccQ4IN2wqx5Gwtt_95o9fr7_cfxaPHz_8u346aHwyqq1aL2Q2PeoOwVd3aPp7myNXt1V2kClwQrT1trXLUjvbSsARGWatrfG16rVWl2zt5fezLrTre40bykzkatEY5SxyjTZJS8un2aihL1bUpggnZ0Ubl_bnVxe2-1rOyFdXjtnPl4ymPF_B0yOfMDosQsJ_eq6Ofwn_ReJzIwP</recordid><startdate>20180501</startdate><enddate>20180501</enddate><creator>K., Vani</creator><creator>Gupta, Deepa</creator><general>Elsevier Ltd</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope></search><sort><creationdate>20180501</creationdate><title>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</title><author>K., Vani ; Gupta, Deepa</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Chunking</topic><topic>Complexity</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Linguistics</topic><topic>Natural language processing</topic><topic>Plagiarism</topic><topic>Plagiarism detection</topic><topic>POS tagging</topic><topic>Semantic role labelling</topic><topic>Semantics</topic><topic>Syntactic-semantic</topic><topic>Systems analysis</topic><topic>Technological change</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>K., Vani</creatorcontrib><creatorcontrib>Gupta, Deepa</creatorcontrib><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><jtitle>Information processing & management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>K., Vani</au><au>Gupta, Deepa</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</atitle><jtitle>Information processing & management</jtitle><date>2018-05-01</date><risdate>2018</risdate><volume>54</volume><issue>3</issue><spage>408</spage><epage>432</epage><pages>408-432</pages><issn>0306-4573</issn><eissn>1873-5371</eissn><abstract>The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.</abstract><cop>Oxford</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2018.01.008</doi><tpages>25</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0306-4573
ispartof	Information processing & management, 2018-05, Vol.54 (3), p.408-432
issn	0306-4573 1873-5371
language	eng
recordid	cdi_proquest_journals_2086367368
source	Elsevier ScienceDirect Journals Complete
subjects	Chunking Complexity Datasets Feature extraction Linguistics Natural language processing Plagiarism Plagiarism detection POS tagging Semantic role labelling Semantics Syntactic-semantic Systems analysis Technological change
title	Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T01%3A35%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unmasking%20text%20plagiarism%20using%20syntactic-semantic%20based%20natural%20language%20processing%20techniques:%20Comparisons,%20analysis%20and%20challenges&rft.jtitle=Information%20processing%20&%20management&rft.au=K.,%20Vani&rft.date=2018-05-01&rft.volume=54&rft.issue=3&rft.spage=408&rft.epage=432&rft.pages=408-432&rft.issn=0306-4573&rft.eissn=1873-5371&rft_id=info:doi/10.1016/j.ipm.2018.01.008&rft_dat=%3Cproquest_cross%3E2086367368%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2086367368&rft_id=info:pmid/&rft_els_id=S0306457317300547&rfr_iscdi=true