Hash-based duplicate data element systems and methods

A method for reducing a storage of duplicated documents is provided. Methods may include hashing each document stored in the centralized data repository by executing a hashing algorithm on the document, outputting a hash-value and adding the hash-value and a hash pointer to a hash table. Methods may...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Jameson, Katherine, Joshi, Neha, Augustine, Casey Andrew, Haddad, Linda, Alleman, Lauren K
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A method for reducing a storage of duplicated documents is provided. Methods may include hashing each document stored in the centralized data repository by executing a hashing algorithm on the document, outputting a hash-value and adding the hash-value and a hash pointer to a hash table. Methods may further include crawling the hash table to identify duplicate hash-values. For each hash-value recorded on the hash table two or more times, methods may include combining two or more duplicate hash-values into a cluster and for each cluster identifying, on the hash table, a unique hash-value. For the unique hash-value, methods may include maintaining the unique hash-value on the hash table and maintaining the document corresponding to the unique hash-value in the memory address. For each remaining duplicate hash-value stored in the cluster, deleting the corresponding document from the memory address and store the reference pointer at the memory address.