Applying Delta Compression to Packed Datasets for Efficient Data Reduction

Backup systems often adopt deduplication techniques for data reduction. Real-world backup products often group files into larger units (called packed files) before deduplicating them. The grouping entails inserting metadata immediately before the contents of each file in the packed file. Some metada...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computers 2024-01, Vol.73 (1), p.73-85
Hauptverfasser:	Zhang, Yucheng, Jiang, Hong, Wang, Chunzhi, Huang, Wei, Chen, Meng, Zhang, Yongxuan, Zhang, Le
Format:	Artikel
Sprache:	eng
Schlagworte:	Back up systems backup system chunk fragmentation Compression ratio Containers Data reduction Datasets Delta compression Encoding Indexing Metadata Prefetching Redundancy Throughput
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Backup systems often adopt deduplication techniques for data reduction. Real-world backup products often group files into larger units (called packed files) before deduplicating them. The grouping entails inserting metadata immediately before the contents of each file in the packed file. Some metadata change with every backup, producing substantial similar (non-duplicate) chunks. Delta compression can remove redundancy among those similar chunks but cannot be applied to HDD-based backup storage because I/Os required for fetching base chunks result in severe throughput loss. For packed datasets, some duplicate chunks, called persistent fragmented chunks (PFCs), are rewritten every backup. We observe that corresponding chunk pairs surrounding identical PFCs are non-identical due to different metadata but similar to each other. In this article, we propose PFC-delta to perform high-performance delta compression for the aforementioned similar chunks on top of deduplication. PFC-delta identifies and prefetches potential base chunks stored along with PFCs by piggybacking on the routine I/Os during deduplication, thus avoiding extra I/Os. We also propose a hash-less delta encoding approach to reduce extra computational overheads. Evaluation results with four real-world datasets show that PFC-delta improves both compression ratio and restore performance, while increasing the backup throughput on all but one datasets.
ISSN:	0018-9340 1557-9956
DOI:	10.1109/TC.2023.3318404