DCDedupe: Selective Deduplication and Delta Compression with Effective Routing for Distributed Storage
Data deduplication has been an essential part of storage systems for big data. Traditional compare-by-hash (CBH) deduplication does not fully address the challenges for similar files with small changes. Delta compression can be complementary to further optimize the storage efficiency. In this paper,...
Gespeichert in:
Veröffentlicht in: | Journal of grid computing 2018-06, Vol.16 (2), p.195-209 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data deduplication has been an essential part of storage systems for big data. Traditional compare-by-hash (CBH) deduplication does not fully address the challenges for similar files with small changes. Delta compression can be complementary to further optimize the storage efficiency. In this paper, we designed and implemented a distributed storage system called DCDedupe that efficiently and intelligently use delta compression or deduplication to improve storage efficiency based on characteristics of data. Unlike prior studies, this system works well when the data locality is weak or even barely exists. In DCDedupe, we propose a pre-processing step to identify content similarity and data chunks are classified into different categories. Then, the appropriate routing algorithm ensures the data chunks are sent to the right target storage nodes in the distributed system to boost the storage efficiency. Our evaluation shows that generally storage space saving by DCDedupe outweighs the performance penalties. In some use cases, DCDeupe may become meaningful to trade off some throughput with optimized storage costs. The overheads to Input/Output (IO) operation and memory usage have also been studied with design recommendations. |
---|---|
ISSN: | 1570-7873 1572-9184 |
DOI: | 10.1007/s10723-018-9429-3 |