A machine learning approach to query generation in plagiarism source retrieval

Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in pla...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Frontiers of information technology & electronic engineering 2017-10, Vol.18 (10), p.1556-1572
Hauptverfasser: Kong, Lei-lei, Lu, Zhi-mao, Qi, Hao-liang, Han, Zhong-yuan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1572
container_issue 10
container_start_page 1556
container_title Frontiers of information technology & electronic engineering
container_volume 18
creator Kong, Lei-lei
Lu, Zhi-mao
Qi, Hao-liang
Han, Zhong-yuan
description Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.
doi_str_mv 10.1631/FITEE.1601344
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2918724438</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><cqvip_id>74908583504849554948484948</cqvip_id><sourcerecordid>2918724438</sourcerecordid><originalsourceid>FETCH-LOGICAL-c348t-9e4990d951ce9cf403d1799a5c5b49d9893dcbbb0409690a70656d123884f9753</originalsourceid><addsrcrecordid>eNp1kMFPwjAUxhujiQQ5em_iediu7dZ3JASUhOgFz0vXdaNkdKMdJvz3FkE9eel7efm99339EHqkZEozRp-Xq81iEVtCGec3aJQSEAmkjNz-9FTyezQJYUcIoRmFHOQIvc3wXumtdQa3RnlnXYNV3_suDvHQ4cPR-BNujDNeDbZz2Drct6qxytuwx6E7em2wN4O35lO1D-iuVm0wk2sdo4_lYjN_TdbvL6v5bJ1oxuWQgOEApAJBtQFdc8IqmgMooUXJoQIJrNJlWRJOIAOicpKJrKIpk5LXkAs2Rk-Xu9FptBiGYheduChZpPGjeco5k5FKLpT2XQje1EXv7V75U0FJcU6t-E6tuKYW-emFD5FzjfF_V_9bYFeBbeeaQ9z5Vcg5ECkkE4RLDkJwiFWeX_YFKJF8RA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2918724438</pqid></control><display><type>article</type><title>A machine learning approach to query generation in plagiarism source retrieval</title><source>Springer Nature - Complete Springer Journals</source><source>ProQuest Central UK/Ireland</source><source>Alma/SFX Local Collection</source><source>ProQuest Central</source><creator>Kong, Lei-lei ; Lu, Zhi-mao ; Qi, Hao-liang ; Han, Zhong-yuan</creator><creatorcontrib>Kong, Lei-lei ; Lu, Zhi-mao ; Qi, Hao-liang ; Han, Zhong-yuan</creatorcontrib><description>Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.</description><identifier>ISSN: 2095-9184</identifier><identifier>EISSN: 2095-9230</identifier><identifier>DOI: 10.1631/FITEE.1601344</identifier><language>eng</language><publisher>Hangzhou: Zhejiang University Press</publisher><subject>Communications Engineering ; Computer Hardware ; Computer Science ; Computer Systems Organization and Communication Networks ; Documents ; Electrical Engineering ; Electronics and Microelectronics ; Heuristic ; Heuristic methods ; Instrumentation ; Machine learning ; Methods ; Networks ; Plagiarism ; Queries ; Retrieval ; Segments</subject><ispartof>Frontiers of information technology &amp; electronic engineering, 2017-10, Vol.18 (10), p.1556-1572</ispartof><rights>Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2017</rights><rights>Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2017.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c348t-9e4990d951ce9cf403d1799a5c5b49d9893dcbbb0409690a70656d123884f9753</citedby><cites>FETCH-LOGICAL-c348t-9e4990d951ce9cf403d1799a5c5b49d9893dcbbb0409690a70656d123884f9753</cites><orcidid>0000-0002-4636-3507</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttp://image.cqvip.com/vip1000/qk/89589A/89589A.jpg</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1631/FITEE.1601344$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2918724438?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,776,780,21368,27903,27904,33723,41467,42536,43784,51297,64361,64365,72215</link.rule.ids></links><search><creatorcontrib>Kong, Lei-lei</creatorcontrib><creatorcontrib>Lu, Zhi-mao</creatorcontrib><creatorcontrib>Qi, Hao-liang</creatorcontrib><creatorcontrib>Han, Zhong-yuan</creatorcontrib><title>A machine learning approach to query generation in plagiarism source retrieval</title><title>Frontiers of information technology &amp; electronic engineering</title><addtitle>Frontiers Inf Technol Electronic Eng</addtitle><addtitle>Frontiers of Information Technology &amp; Electronic Engineering</addtitle><description>Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.</description><subject>Communications Engineering</subject><subject>Computer Hardware</subject><subject>Computer Science</subject><subject>Computer Systems Organization and Communication Networks</subject><subject>Documents</subject><subject>Electrical Engineering</subject><subject>Electronics and Microelectronics</subject><subject>Heuristic</subject><subject>Heuristic methods</subject><subject>Instrumentation</subject><subject>Machine learning</subject><subject>Methods</subject><subject>Networks</subject><subject>Plagiarism</subject><subject>Queries</subject><subject>Retrieval</subject><subject>Segments</subject><issn>2095-9184</issn><issn>2095-9230</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1kMFPwjAUxhujiQQ5em_iediu7dZ3JASUhOgFz0vXdaNkdKMdJvz3FkE9eel7efm99339EHqkZEozRp-Xq81iEVtCGec3aJQSEAmkjNz-9FTyezQJYUcIoRmFHOQIvc3wXumtdQa3RnlnXYNV3_suDvHQ4cPR-BNujDNeDbZz2Drct6qxytuwx6E7em2wN4O35lO1D-iuVm0wk2sdo4_lYjN_TdbvL6v5bJ1oxuWQgOEApAJBtQFdc8IqmgMooUXJoQIJrNJlWRJOIAOicpKJrKIpk5LXkAs2Rk-Xu9FptBiGYheduChZpPGjeco5k5FKLpT2XQje1EXv7V75U0FJcU6t-E6tuKYW-emFD5FzjfF_V_9bYFeBbeeaQ9z5Vcg5ECkkE4RLDkJwiFWeX_YFKJF8RA</recordid><startdate>20171001</startdate><enddate>20171001</enddate><creator>Kong, Lei-lei</creator><creator>Lu, Zhi-mao</creator><creator>Qi, Hao-liang</creator><creator>Han, Zhong-yuan</creator><general>Zhejiang University Press</general><general>Springer Nature B.V</general><scope>2RA</scope><scope>92L</scope><scope>CQIGP</scope><scope>~WA</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope><orcidid>https://orcid.org/0000-0002-4636-3507</orcidid></search><sort><creationdate>20171001</creationdate><title>A machine learning approach to query generation in plagiarism source retrieval</title><author>Kong, Lei-lei ; Lu, Zhi-mao ; Qi, Hao-liang ; Han, Zhong-yuan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c348t-9e4990d951ce9cf403d1799a5c5b49d9893dcbbb0409690a70656d123884f9753</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Communications Engineering</topic><topic>Computer Hardware</topic><topic>Computer Science</topic><topic>Computer Systems Organization and Communication Networks</topic><topic>Documents</topic><topic>Electrical Engineering</topic><topic>Electronics and Microelectronics</topic><topic>Heuristic</topic><topic>Heuristic methods</topic><topic>Instrumentation</topic><topic>Machine learning</topic><topic>Methods</topic><topic>Networks</topic><topic>Plagiarism</topic><topic>Queries</topic><topic>Retrieval</topic><topic>Segments</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kong, Lei-lei</creatorcontrib><creatorcontrib>Lu, Zhi-mao</creatorcontrib><creatorcontrib>Qi, Hao-liang</creatorcontrib><creatorcontrib>Han, Zhong-yuan</creatorcontrib><collection>中文科技期刊数据库</collection><collection>中文科技期刊数据库-CALIS站点</collection><collection>中文科技期刊数据库-7.0平台</collection><collection>中文科技期刊数据库- 镜像站点</collection><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection><jtitle>Frontiers of information technology &amp; electronic engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kong, Lei-lei</au><au>Lu, Zhi-mao</au><au>Qi, Hao-liang</au><au>Han, Zhong-yuan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A machine learning approach to query generation in plagiarism source retrieval</atitle><jtitle>Frontiers of information technology &amp; electronic engineering</jtitle><stitle>Frontiers Inf Technol Electronic Eng</stitle><addtitle>Frontiers of Information Technology &amp; Electronic Engineering</addtitle><date>2017-10-01</date><risdate>2017</risdate><volume>18</volume><issue>10</issue><spage>1556</spage><epage>1572</epage><pages>1556-1572</pages><issn>2095-9184</issn><eissn>2095-9230</eissn><abstract>Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.</abstract><cop>Hangzhou</cop><pub>Zhejiang University Press</pub><doi>10.1631/FITEE.1601344</doi><tpages>17</tpages><orcidid>https://orcid.org/0000-0002-4636-3507</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 2095-9184
ispartof Frontiers of information technology & electronic engineering, 2017-10, Vol.18 (10), p.1556-1572
issn 2095-9184
2095-9230
language eng
recordid cdi_proquest_journals_2918724438
source Springer Nature - Complete Springer Journals; ProQuest Central UK/Ireland; Alma/SFX Local Collection; ProQuest Central
subjects Communications Engineering
Computer Hardware
Computer Science
Computer Systems Organization and Communication Networks
Documents
Electrical Engineering
Electronics and Microelectronics
Heuristic
Heuristic methods
Instrumentation
Machine learning
Methods
Networks
Plagiarism
Queries
Retrieval
Segments
title A machine learning approach to query generation in plagiarism source retrieval
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T15%3A58%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20machine%20learning%20approach%20to%20query%20generation%20in%20plagiarism%20source%20retrieval&rft.jtitle=Frontiers%20of%20information%20technology%20&%20electronic%20engineering&rft.au=Kong,%20Lei-lei&rft.date=2017-10-01&rft.volume=18&rft.issue=10&rft.spage=1556&rft.epage=1572&rft.pages=1556-1572&rft.issn=2095-9184&rft.eissn=2095-9230&rft_id=info:doi/10.1631/FITEE.1601344&rft_dat=%3Cproquest_cross%3E2918724438%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2918724438&rft_id=info:pmid/&rft_cqvip_id=74908583504849554948484948&rfr_iscdi=true