Scalable high performance de-duplication backup via hash join

Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output （I/O） overhead as a result of consta...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Frontiers of information technology & electronic engineering 2010-05, Vol.11 (5), p.315-327
Hauptverfasser:	Yang, Tian-ming, Feng, Dan, Niu, Zhong-ying, Wan, Ya-ping
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Back up systems Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Electrical Engineering Electronics and Microelectronics Fingerprints Instrumentation Networks Storage systems Workload Workloads
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output （I/O） overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.
ISSN:	1869-1951 2095-9184 1869-196X 2095-9230
DOI:	10.1631/jzus.C0910445