Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges

The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiariz...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information processing & management 2018-05, Vol.54 (3), p.408-432
Hauptverfasser: K., Vani, Gupta, Deepa
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 432
container_issue 3
container_start_page 408
container_title Information processing & management
container_volume 54
creator K., Vani
Gupta, Deepa
description The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.
doi_str_mv 10.1016/j.ipm.2018.01.008
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2086367368</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457317300547</els_id><sourcerecordid>2086367368</sourcerecordid><originalsourceid>FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</originalsourceid><addsrcrecordid>eNp9kM2O1DAQhC3ESgy7-wDcLHElwY5jOwsnNFp-pJW4sGer43SyHhInuBO08wC8N46GM6cutapKn4qxN1KUUkjz_lSGZSorIZtSyFKI5gU7yMaqQisrX7KDUMIUtbbqFXtNdBJC1FpWB_bnMU5AP0Mc-IrPK19GGAKkQBPfaP_SOa7g1-ALwgliFrwFwo5HWLcEIx8hDhsMyJc0eyS6VPmnGH5tSB_4cZ6WvXCO9I5DhPFMgbLouH-CccQ4IN2wqx5Gwtt_95o9fr7_cfxaPHz_8u346aHwyqq1aL2Q2PeoOwVd3aPp7myNXt1V2kClwQrT1trXLUjvbSsARGWatrfG16rVWl2zt5fezLrTre40bykzkatEY5SxyjTZJS8un2aihL1bUpggnZ0Ubl_bnVxe2-1rOyFdXjtnPl4ymPF_B0yOfMDosQsJ_eq6Ofwn_ReJzIwP</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2086367368</pqid></control><display><type>article</type><title>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</title><source>Elsevier ScienceDirect Journals Complete</source><creator>K., Vani ; Gupta, Deepa</creator><creatorcontrib>K., Vani ; Gupta, Deepa</creatorcontrib><description>The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2018.01.008</identifier><language>eng</language><publisher>Oxford: Elsevier Ltd</publisher><subject>Chunking ; Complexity ; Datasets ; Feature extraction ; Linguistics ; Natural language processing ; Plagiarism ; Plagiarism detection ; POS tagging ; Semantic role labelling ; Semantics ; Syntactic-semantic ; Systems analysis ; Technological change</subject><ispartof>Information processing &amp; management, 2018-05, Vol.54 (3), p.408-432</ispartof><rights>2018 Elsevier Ltd</rights><rights>Copyright Pergamon Press Inc. May 2018</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</citedby><cites>FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ipm.2018.01.008$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,776,780,3536,27903,27904,45974</link.rule.ids></links><search><creatorcontrib>K., Vani</creatorcontrib><creatorcontrib>Gupta, Deepa</creatorcontrib><title>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</title><title>Information processing &amp; management</title><description>The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.</description><subject>Chunking</subject><subject>Complexity</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Linguistics</subject><subject>Natural language processing</subject><subject>Plagiarism</subject><subject>Plagiarism detection</subject><subject>POS tagging</subject><subject>Semantic role labelling</subject><subject>Semantics</subject><subject>Syntactic-semantic</subject><subject>Systems analysis</subject><subject>Technological change</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNp9kM2O1DAQhC3ESgy7-wDcLHElwY5jOwsnNFp-pJW4sGer43SyHhInuBO08wC8N46GM6cutapKn4qxN1KUUkjz_lSGZSorIZtSyFKI5gU7yMaqQisrX7KDUMIUtbbqFXtNdBJC1FpWB_bnMU5AP0Mc-IrPK19GGAKkQBPfaP_SOa7g1-ALwgliFrwFwo5HWLcEIx8hDhsMyJc0eyS6VPmnGH5tSB_4cZ6WvXCO9I5DhPFMgbLouH-CccQ4IN2wqx5Gwtt_95o9fr7_cfxaPHz_8u346aHwyqq1aL2Q2PeoOwVd3aPp7myNXt1V2kClwQrT1trXLUjvbSsARGWatrfG16rVWl2zt5fezLrTre40bykzkatEY5SxyjTZJS8un2aihL1bUpggnZ0Ubl_bnVxe2-1rOyFdXjtnPl4ymPF_B0yOfMDosQsJ_eq6Ofwn_ReJzIwP</recordid><startdate>20180501</startdate><enddate>20180501</enddate><creator>K., Vani</creator><creator>Gupta, Deepa</creator><general>Elsevier Ltd</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope></search><sort><creationdate>20180501</creationdate><title>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</title><author>K., Vani ; Gupta, Deepa</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Chunking</topic><topic>Complexity</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Linguistics</topic><topic>Natural language processing</topic><topic>Plagiarism</topic><topic>Plagiarism detection</topic><topic>POS tagging</topic><topic>Semantic role labelling</topic><topic>Semantics</topic><topic>Syntactic-semantic</topic><topic>Systems analysis</topic><topic>Technological change</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>K., Vani</creatorcontrib><creatorcontrib>Gupta, Deepa</creatorcontrib><collection>CrossRef</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><jtitle>Information processing &amp; management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>K., Vani</au><au>Gupta, Deepa</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</atitle><jtitle>Information processing &amp; management</jtitle><date>2018-05-01</date><risdate>2018</risdate><volume>54</volume><issue>3</issue><spage>408</spage><epage>432</epage><pages>408-432</pages><issn>0306-4573</issn><eissn>1873-5371</eissn><abstract>The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.</abstract><cop>Oxford</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2018.01.008</doi><tpages>25</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0306-4573
ispartof Information processing & management, 2018-05, Vol.54 (3), p.408-432
issn 0306-4573
1873-5371
language eng
recordid cdi_proquest_journals_2086367368
source Elsevier ScienceDirect Journals Complete
subjects Chunking
Complexity
Datasets
Feature extraction
Linguistics
Natural language processing
Plagiarism
Plagiarism detection
POS tagging
Semantic role labelling
Semantics
Syntactic-semantic
Systems analysis
Technological change
title Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T01%3A35%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unmasking%20text%20plagiarism%20using%20syntactic-semantic%20based%20natural%20language%20processing%20techniques:%20Comparisons,%20analysis%20and%20challenges&rft.jtitle=Information%20processing%20&%20management&rft.au=K.,%20Vani&rft.date=2018-05-01&rft.volume=54&rft.issue=3&rft.spage=408&rft.epage=432&rft.pages=408-432&rft.issn=0306-4573&rft.eissn=1873-5371&rft_id=info:doi/10.1016/j.ipm.2018.01.008&rft_dat=%3Cproquest_cross%3E2086367368%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2086367368&rft_id=info:pmid/&rft_els_id=S0306457317300547&rfr_iscdi=true