Research and application of the detection on duplicate web pages on campus search engine
At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy. |
---|---|
ISSN: | 2327-0586 |
DOI: | 10.1109/ICSESS.2012.6269527 |