Research and application of the detection on duplicate web pages on campus search engine

At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Yongbing Gao, Fang Zhang, Bin Hao, Wei Gong
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 558
container_issue
container_start_page 555
container_title
container_volume
creator Yongbing Gao
Fang Zhang
Bin Hao
Wei Gong
description At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy.
doi_str_mv 10.1109/ICSESS.2012.6269527
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6269527</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6269527</ieee_id><sourcerecordid>6269527</sourcerecordid><originalsourceid>FETCH-LOGICAL-i90t-36aef5a8e042dc9699d35d62c6b01ab8a56bc2f5fddf56d64a2de24139255dc53</originalsourceid><addsrcrecordid>eNo1kFFLwzAUhSMqOGd_wV7yB1qTmyZpHqVMHQwEuwffxm1yu0W2rqwd4r-X0fp0ON-B7-EwtpAik1K451VZLasqAyEhM2CcBnvDHmVurAIhTH7LEmeL_26LOzYDBTYVujAPLOn7byGuXIK2M_b1ST3h2e85toFj1x2ixyGeWn5q-LAnHmggP4KWh8u4E_-hmne4o_6KPR67S88nEbW72NITu2_w0FMy5ZxtXpeb8j1df7ytypd1Gp0YUmWQGo0FiRyCd8a5oHQw4E0tJNYFalN7aHQTQqNNMDlCIMilcqB18FrN2WLURiLadud4xPPvdvpF_QEa2FXi</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Research and application of the detection on duplicate web pages on campus search engine</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Yongbing Gao ; Fang Zhang ; Bin Hao ; Wei Gong</creator><creatorcontrib>Yongbing Gao ; Fang Zhang ; Bin Hao ; Wei Gong</creatorcontrib><description>At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy.</description><identifier>ISSN: 2327-0586</identifier><identifier>ISBN: 9781467320078</identifier><identifier>ISBN: 1467320072</identifier><identifier>EISBN: 1467320064</identifier><identifier>EISBN: 1467320080</identifier><identifier>EISBN: 9781467320085</identifier><identifier>EISBN: 9781467320061</identifier><identifier>DOI: 10.1109/ICSESS.2012.6269527</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Campus Search Engine ; Duplicate Detection ; Educational institutions ; Fingerprint recognition ; MD5 ; Nutch ; Paragraph Fingerprint</subject><ispartof>2012 IEEE International Conference on Computer Science and Automation Engineering, 2012, p.555-558</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6269527$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6269527$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yongbing Gao</creatorcontrib><creatorcontrib>Fang Zhang</creatorcontrib><creatorcontrib>Bin Hao</creatorcontrib><creatorcontrib>Wei Gong</creatorcontrib><title>Research and application of the detection on duplicate web pages on campus search engine</title><title>2012 IEEE International Conference on Computer Science and Automation Engineering</title><addtitle>ICSESS</addtitle><description>At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy.</description><subject>Accuracy</subject><subject>Campus Search Engine</subject><subject>Duplicate Detection</subject><subject>Educational institutions</subject><subject>Fingerprint recognition</subject><subject>MD5</subject><subject>Nutch</subject><subject>Paragraph Fingerprint</subject><issn>2327-0586</issn><isbn>9781467320078</isbn><isbn>1467320072</isbn><isbn>1467320064</isbn><isbn>1467320080</isbn><isbn>9781467320085</isbn><isbn>9781467320061</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNo1kFFLwzAUhSMqOGd_wV7yB1qTmyZpHqVMHQwEuwffxm1yu0W2rqwd4r-X0fp0ON-B7-EwtpAik1K451VZLasqAyEhM2CcBnvDHmVurAIhTH7LEmeL_26LOzYDBTYVujAPLOn7byGuXIK2M_b1ST3h2e85toFj1x2ixyGeWn5q-LAnHmggP4KWh8u4E_-hmne4o_6KPR67S88nEbW72NITu2_w0FMy5ZxtXpeb8j1df7ytypd1Gp0YUmWQGo0FiRyCd8a5oHQw4E0tJNYFalN7aHQTQqNNMDlCIMilcqB18FrN2WLURiLadud4xPPvdvpF_QEa2FXi</recordid><startdate>201206</startdate><enddate>201206</enddate><creator>Yongbing Gao</creator><creator>Fang Zhang</creator><creator>Bin Hao</creator><creator>Wei Gong</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201206</creationdate><title>Research and application of the detection on duplicate web pages on campus search engine</title><author>Yongbing Gao ; Fang Zhang ; Bin Hao ; Wei Gong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i90t-36aef5a8e042dc9699d35d62c6b01ab8a56bc2f5fddf56d64a2de24139255dc53</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Accuracy</topic><topic>Campus Search Engine</topic><topic>Duplicate Detection</topic><topic>Educational institutions</topic><topic>Fingerprint recognition</topic><topic>MD5</topic><topic>Nutch</topic><topic>Paragraph Fingerprint</topic><toplevel>online_resources</toplevel><creatorcontrib>Yongbing Gao</creatorcontrib><creatorcontrib>Fang Zhang</creatorcontrib><creatorcontrib>Bin Hao</creatorcontrib><creatorcontrib>Wei Gong</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yongbing Gao</au><au>Fang Zhang</au><au>Bin Hao</au><au>Wei Gong</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Research and application of the detection on duplicate web pages on campus search engine</atitle><btitle>2012 IEEE International Conference on Computer Science and Automation Engineering</btitle><stitle>ICSESS</stitle><date>2012-06</date><risdate>2012</risdate><spage>555</spage><epage>558</epage><pages>555-558</pages><issn>2327-0586</issn><isbn>9781467320078</isbn><isbn>1467320072</isbn><eisbn>1467320064</eisbn><eisbn>1467320080</eisbn><eisbn>9781467320085</eisbn><eisbn>9781467320061</eisbn><abstract>At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy.</abstract><pub>IEEE</pub><doi>10.1109/ICSESS.2012.6269527</doi><tpages>4</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2327-0586
ispartof 2012 IEEE International Conference on Computer Science and Automation Engineering, 2012, p.555-558
issn 2327-0586
language eng
recordid cdi_ieee_primary_6269527
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Accuracy
Campus Search Engine
Duplicate Detection
Educational institutions
Fingerprint recognition
MD5
Nutch
Paragraph Fingerprint
title Research and application of the detection on duplicate web pages on campus search engine
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T19%3A25%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Research%20and%20application%20of%20the%20detection%20on%20duplicate%20web%20pages%20on%20campus%20search%20engine&rft.btitle=2012%20IEEE%20International%20Conference%20on%20Computer%20Science%20and%20Automation%20Engineering&rft.au=Yongbing%20Gao&rft.date=2012-06&rft.spage=555&rft.epage=558&rft.pages=555-558&rft.issn=2327-0586&rft.isbn=9781467320078&rft.isbn_list=1467320072&rft_id=info:doi/10.1109/ICSESS.2012.6269527&rft_dat=%3Cieee_6IE%3E6269527%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=1467320064&rft.eisbn_list=1467320080&rft.eisbn_list=9781467320085&rft.eisbn_list=9781467320061&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6269527&rfr_iscdi=true