Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges
The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiariz...
Gespeichert in:
Veröffentlicht in: | Information processing & management 2018-05, Vol.54 (3), p.408-432 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 432 |
---|---|
container_issue | 3 |
container_start_page | 408 |
container_title | Information processing & management |
container_volume | 54 |
creator | K., Vani Gupta, Deepa |
description | The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data. |
doi_str_mv | 10.1016/j.ipm.2018.01.008 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2086367368</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457317300547</els_id><sourcerecordid>2086367368</sourcerecordid><originalsourceid>FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</originalsourceid><addsrcrecordid>eNp9kM2O1DAQhC3ESgy7-wDcLHElwY5jOwsnNFp-pJW4sGer43SyHhInuBO08wC8N46GM6cutapKn4qxN1KUUkjz_lSGZSorIZtSyFKI5gU7yMaqQisrX7KDUMIUtbbqFXtNdBJC1FpWB_bnMU5AP0Mc-IrPK19GGAKkQBPfaP_SOa7g1-ALwgliFrwFwo5HWLcEIx8hDhsMyJc0eyS6VPmnGH5tSB_4cZ6WvXCO9I5DhPFMgbLouH-CccQ4IN2wqx5Gwtt_95o9fr7_cfxaPHz_8u346aHwyqq1aL2Q2PeoOwVd3aPp7myNXt1V2kClwQrT1trXLUjvbSsARGWatrfG16rVWl2zt5fezLrTre40bykzkatEY5SxyjTZJS8un2aihL1bUpggnZ0Ubl_bnVxe2-1rOyFdXjtnPl4ymPF_B0yOfMDosQsJ_eq6Ofwn_ReJzIwP</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2086367368</pqid></control><display><type>article</type><title>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</title><source>Elsevier ScienceDirect Journals Complete</source><creator>K., Vani ; Gupta, Deepa</creator><creatorcontrib>K., Vani ; Gupta, Deepa</creatorcontrib><description>The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2018.01.008</identifier><language>eng</language><publisher>Oxford: Elsevier Ltd</publisher><subject>Chunking ; Complexity ; Datasets ; Feature extraction ; Linguistics ; Natural language processing ; Plagiarism ; Plagiarism detection ; POS tagging ; Semantic role labelling ; Semantics ; Syntactic-semantic ; Systems analysis ; Technological change</subject><ispartof>Information processing & management, 2018-05, Vol.54 (3), p.408-432</ispartof><rights>2018 Elsevier Ltd</rights><rights>Copyright Pergamon Press Inc. May 2018</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</citedby><cites>FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ipm.2018.01.008$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,776,780,3536,27903,27904,45974</link.rule.ids></links><search><creatorcontrib>K., Vani</creatorcontrib><creatorcontrib>Gupta, Deepa</creatorcontrib><title>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</title><title>Information processing & management</title><description>The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.</description><subject>Chunking</subject><subject>Complexity</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Linguistics</subject><subject>Natural language processing</subject><subject>Plagiarism</subject><subject>Plagiarism detection</subject><subject>POS tagging</subject><subject>Semantic role labelling</subject><subject>Semantics</subject><subject>Syntactic-semantic</subject><subject>Systems analysis</subject><subject>Technological change</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNp9kM2O1DAQhC3ESgy7-wDcLHElwY5jOwsnNFp-pJW4sGer43SyHhInuBO08wC8N46GM6cutapKn4qxN1KUUkjz_lSGZSorIZtSyFKI5gU7yMaqQisrX7KDUMIUtbbqFXtNdBJC1FpWB_bnMU5AP0Mc-IrPK19GGAKkQBPfaP_SOa7g1-ALwgliFrwFwo5HWLcEIx8hDhsMyJc0eyS6VPmnGH5tSB_4cZ6WvXCO9I5DhPFMgbLouH-CccQ4IN2wqx5Gwtt_95o9fr7_cfxaPHz_8u346aHwyqq1aL2Q2PeoOwVd3aPp7myNXt1V2kClwQrT1trXLUjvbSsARGWatrfG16rVWl2zt5fezLrTre40bykzkatEY5SxyjTZJS8un2aihL1bUpggnZ0Ubl_bnVxe2-1rOyFdXjtnPl4ymPF_B0yOfMDosQsJ_eq6Ofwn_ReJzIwP</recordid><startdate>20180501</startdate><enddate>20180501</enddate><creator>K., Vani</creator><creator>Gupta, Deepa</creator><general>Elsevier Ltd</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope></search><sort><creationdate>20180501</creationdate><title>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</title><author>K., Vani ; Gupta, Deepa</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c373t-bc01effe5d3ad4fe6d974ec39256a25a706b45c4ba1cc7b0aa0268bf76c43b553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Chunking</topic><topic>Complexity</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Linguistics</topic><topic>Natural language processing</topic><topic>Plagiarism</topic><topic>Plagiarism detection</topic><topic>POS tagging</topic><topic>Semantic role labelling</topic><topic>Semantics</topic><topic>Syntactic-semantic</topic><topic>Systems analysis</topic><topic>Technological change</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>K., Vani</creatorcontrib><creatorcontrib>Gupta, Deepa</creatorcontrib><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><jtitle>Information processing & management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>K., Vani</au><au>Gupta, Deepa</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges</atitle><jtitle>Information processing & management</jtitle><date>2018-05-01</date><risdate>2018</risdate><volume>54</volume><issue>3</issue><spage>408</spage><epage>432</epage><pages>408-432</pages><issn>0306-4573</issn><eissn>1873-5371</eissn><abstract>The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN11http://pan.webis.de. competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.</abstract><cop>Oxford</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2018.01.008</doi><tpages>25</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0306-4573 |
ispartof | Information processing & management, 2018-05, Vol.54 (3), p.408-432 |
issn | 0306-4573 1873-5371 |
language | eng |
recordid | cdi_proquest_journals_2086367368 |
source | Elsevier ScienceDirect Journals Complete |
subjects | Chunking Complexity Datasets Feature extraction Linguistics Natural language processing Plagiarism Plagiarism detection POS tagging Semantic role labelling Semantics Syntactic-semantic Systems analysis Technological change |
title | Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T01%3A35%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unmasking%20text%20plagiarism%20using%20syntactic-semantic%20based%20natural%20language%20processing%20techniques:%20Comparisons,%20analysis%20and%20challenges&rft.jtitle=Information%20processing%20&%20management&rft.au=K.,%20Vani&rft.date=2018-05-01&rft.volume=54&rft.issue=3&rft.spage=408&rft.epage=432&rft.pages=408-432&rft.issn=0306-4573&rft.eissn=1873-5371&rft_id=info:doi/10.1016/j.ipm.2018.01.008&rft_dat=%3Cproquest_cross%3E2086367368%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2086367368&rft_id=info:pmid/&rft_els_id=S0306457317300547&rfr_iscdi=true |