Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of the American Society for Information Science and Technology 2006-05, Vol.57 (7), p.921-932
Hauptverfasser:	Conrad, Jack G., Schriber, Cindy P.
Format:	Artikel
Sprache:	eng
Schlagworte:	Acquisition and access: development policy, licenses, censorship Collection Collection management Computerized bibliographic records Detection Duplicates Exact sciences and technology Identification Identification methods Information and communication sciences Information management Information retrieval Information science. Documentation Information service management Library and documentation centre management Online information retrieval Reproduction (copying) Sciences and techniques of general use Searches Searching Studies
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its deleterious effect on search results. Harnessing the expertise of both client‐users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. We subsequently examine a flexible method of characterizing and comparing documents to permit the identification of near duplicates. This method has produced promising results following an extensive evaluation using a production‐based test collection created by domain experts.
ISSN:	1532-2882 2330-1635 1532-2890 2330-1643
DOI:	10.1002/asi.20363