Research and application of the detection on duplicate web pages on campus search engine

At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yongbing Gao, Fang Zhang, Bin Hao, Wei Gong
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Accuracy Campus Search Engine Duplicate Detection Educational institutions Fingerprint recognition MD5 Nutch Paragraph Fingerprint
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	At present, for some commercial purposes, general search engine can't satisfy timeliness, integrity of collected information and sort of results of campus network. It is necessary that various universities construct their own campus network search engine. In addition, information resources are reprinted between department websites, and users often get duplicate pages which have similar content in the search results pages. The necessity of constructing campus search engine is analyzed, and the detection algorithm which is tested in Nutch based on the longest paragraph and fingerprint is proposed. The analysis and experiments show that the algorithm efficiently reduces duplicate documents. It shows better ability of resistance noise, lower complexity of time and space, higher recall and accuracy.
ISSN:	2327-0586
DOI:	10.1109/ICSESS.2012.6269527