Content-level deduplication on mobile internet datasets

Various systems and applications involve a large volume of duplicate items. Based on high data redundancy in real world datasets, data deduplication can reduce storage capacity and improve the utilization of network bandwidth. However, chunks of existing deduplications range in size from 4KB to over...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hou, Ziyu, Chen, Xunxun, Wang, Yang
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Datasets Internet Redundancy Relational data bases Storage capacity
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Various systems and applications involve a large volume of duplicate items. Based on high data redundancy in real world datasets, data deduplication can reduce storage capacity and improve the utilization of network bandwidth. However, chunks of existing deduplications range in size from 4KB to over 16KB, existing systems are not applicable to the datasets consisting of short records. In this paper, we propose a new framework called SF-Dedup which is able to implement the deduplication process on a large set of Mobile Internet records, the size of records can be smaller than 100B, or even smaller than 10B. SF-Dedup is a short fingerprint, in-line, hash-collisions-resolved deduplication. Results of experimental applications illustrate that SH-Dedup is able to reduce storage capacity and shorten query time on relational database.
ISSN:	0094-243X 1551-7616
DOI:	10.1063/1.4982026