Research and application of the detection on duplicate web pages on campus search engine
At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 558 |
---|---|
container_issue | |
container_start_page | 555 |
container_title | |
container_volume | |
creator | Yongbing Gao Fang Zhang Bin Hao Wei Gong |
description | At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy. |
doi_str_mv | 10.1109/ICSESS.2012.6269527 |
format | Conference Proceeding |
fullrecord | <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_6269527</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6269527</ieee_id><sourcerecordid>6269527</sourcerecordid><originalsourceid>FETCH-LOGICAL-i90t-36aef5a8e042dc9699d35d62c6b01ab8a56bc2f5fddf56d64a2de24139255dc53</originalsourceid><addsrcrecordid>eNo1kFFLwzAUhSMqOGd_wV7yB1qTmyZpHqVMHQwEuwffxm1yu0W2rqwd4r-X0fp0ON-B7-EwtpAik1K451VZLasqAyEhM2CcBnvDHmVurAIhTH7LEmeL_26LOzYDBTYVujAPLOn7byGuXIK2M_b1ST3h2e85toFj1x2ixyGeWn5q-LAnHmggP4KWh8u4E_-hmne4o_6KPR67S88nEbW72NITu2_w0FMy5ZxtXpeb8j1df7ytypd1Gp0YUmWQGo0FiRyCd8a5oHQw4E0tJNYFalN7aHQTQqNNMDlCIMilcqB18FrN2WLURiLadud4xPPvdvpF_QEa2FXi</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Research and application of the detection on duplicate web pages on campus search engine</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Yongbing Gao ; Fang Zhang ; Bin Hao ; Wei Gong</creator><creatorcontrib>Yongbing Gao ; Fang Zhang ; Bin Hao ; Wei Gong</creatorcontrib><description>At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy.</description><identifier>ISSN: 2327-0586</identifier><identifier>ISBN: 9781467320078</identifier><identifier>ISBN: 1467320072</identifier><identifier>EISBN: 1467320064</identifier><identifier>EISBN: 1467320080</identifier><identifier>EISBN: 9781467320085</identifier><identifier>EISBN: 9781467320061</identifier><identifier>DOI: 10.1109/ICSESS.2012.6269527</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Campus Search Engine ; Duplicate Detection ; Educational institutions ; Fingerprint recognition ; MD5 ; Nutch ; Paragraph Fingerprint</subject><ispartof>2012 IEEE International Conference on Computer Science and Automation Engineering, 2012, p.555-558</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6269527$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6269527$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yongbing Gao</creatorcontrib><creatorcontrib>Fang Zhang</creatorcontrib><creatorcontrib>Bin Hao</creatorcontrib><creatorcontrib>Wei Gong</creatorcontrib><title>Research and application of the detection on duplicate web pages on campus search engine</title><title>2012 IEEE International Conference on Computer Science and Automation Engineering</title><addtitle>ICSESS</addtitle><description>At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy.</description><subject>Accuracy</subject><subject>Campus Search Engine</subject><subject>Duplicate Detection</subject><subject>Educational institutions</subject><subject>Fingerprint recognition</subject><subject>MD5</subject><subject>Nutch</subject><subject>Paragraph Fingerprint</subject><issn>2327-0586</issn><isbn>9781467320078</isbn><isbn>1467320072</isbn><isbn>1467320064</isbn><isbn>1467320080</isbn><isbn>9781467320085</isbn><isbn>9781467320061</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2012</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNo1kFFLwzAUhSMqOGd_wV7yB1qTmyZpHqVMHQwEuwffxm1yu0W2rqwd4r-X0fp0ON-B7-EwtpAik1K451VZLasqAyEhM2CcBnvDHmVurAIhTH7LEmeL_26LOzYDBTYVujAPLOn7byGuXIK2M_b1ST3h2e85toFj1x2ixyGeWn5q-LAnHmggP4KWh8u4E_-hmne4o_6KPR67S88nEbW72NITu2_w0FMy5ZxtXpeb8j1df7ytypd1Gp0YUmWQGo0FiRyCd8a5oHQw4E0tJNYFalN7aHQTQqNNMDlCIMilcqB18FrN2WLURiLadud4xPPvdvpF_QEa2FXi</recordid><startdate>201206</startdate><enddate>201206</enddate><creator>Yongbing Gao</creator><creator>Fang Zhang</creator><creator>Bin Hao</creator><creator>Wei Gong</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>201206</creationdate><title>Research and application of the detection on duplicate web pages on campus search engine</title><author>Yongbing Gao ; Fang Zhang ; Bin Hao ; Wei Gong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i90t-36aef5a8e042dc9699d35d62c6b01ab8a56bc2f5fddf56d64a2de24139255dc53</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Accuracy</topic><topic>Campus Search Engine</topic><topic>Duplicate Detection</topic><topic>Educational institutions</topic><topic>Fingerprint recognition</topic><topic>MD5</topic><topic>Nutch</topic><topic>Paragraph Fingerprint</topic><toplevel>online_resources</toplevel><creatorcontrib>Yongbing Gao</creatorcontrib><creatorcontrib>Fang Zhang</creatorcontrib><creatorcontrib>Bin Hao</creatorcontrib><creatorcontrib>Wei Gong</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yongbing Gao</au><au>Fang Zhang</au><au>Bin Hao</au><au>Wei Gong</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Research and application of the detection on duplicate web pages on campus search engine</atitle><btitle>2012 IEEE International Conference on Computer Science and Automation Engineering</btitle><stitle>ICSESS</stitle><date>2012-06</date><risdate>2012</risdate><spage>555</spage><epage>558</epage><pages>555-558</pages><issn>2327-0586</issn><isbn>9781467320078</isbn><isbn>1467320072</isbn><eisbn>1467320064</eisbn><eisbn>1467320080</eisbn><eisbn>9781467320085</eisbn><eisbn>9781467320061</eisbn><abstract>At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy.</abstract><pub>IEEE</pub><doi>10.1109/ICSESS.2012.6269527</doi><tpages>4</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2327-0586 |
ispartof | 2012 IEEE International Conference on Computer Science and Automation Engineering, 2012, p.555-558 |
issn | 2327-0586 |
language | eng |
recordid | cdi_ieee_primary_6269527 |
source | IEEE Electronic Library (IEL) Conference Proceedings |
subjects | Accuracy Campus Search Engine Duplicate Detection Educational institutions Fingerprint recognition MD5 Nutch Paragraph Fingerprint |
title | Research and application of the detection on duplicate web pages on campus search engine |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T19%3A25%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Research%20and%20application%20of%20the%20detection%20on%20duplicate%20web%20pages%20on%20campus%20search%20engine&rft.btitle=2012%20IEEE%20International%20Conference%20on%20Computer%20Science%20and%20Automation%20Engineering&rft.au=Yongbing%20Gao&rft.date=2012-06&rft.spage=555&rft.epage=558&rft.pages=555-558&rft.issn=2327-0586&rft.isbn=9781467320078&rft.isbn_list=1467320072&rft_id=info:doi/10.1109/ICSESS.2012.6269527&rft_dat=%3Cieee_6IE%3E6269527%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=1467320064&rft.eisbn_list=1467320080&rft.eisbn_list=9781467320085&rft.eisbn_list=9781467320061&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6269527&rfr_iscdi=true |