Applying Delta Compression to Packed Datasets for Efficient Data Reduction
Backup systems often adopt deduplication techniques for data reduction. Real-world backup products often group files into larger units (called packed files) before deduplicating them. The grouping entails inserting metadata immediately before the contents of each file in the packed file. Some metada...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on computers 2024-01, Vol.73 (1), p.73-85 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Backup systems often adopt deduplication techniques for data reduction. Real-world backup products often group files into larger units (called packed files) before deduplicating them. The grouping entails inserting metadata immediately before the contents of each file in the packed file. Some metadata change with every backup, producing substantial similar (non-duplicate) chunks. Delta compression can remove redundancy among those similar chunks but cannot be applied to HDD-based backup storage because I/Os required for fetching base chunks result in severe throughput loss. For packed datasets, some duplicate chunks, called persistent fragmented chunks (PFCs), are rewritten every backup. We observe that corresponding chunk pairs surrounding identical PFCs are non-identical due to different metadata but similar to each other. In this article, we propose PFC-delta to perform high-performance delta compression for the aforementioned similar chunks on top of deduplication. PFC-delta identifies and prefetches potential base chunks stored along with PFCs by piggybacking on the routine I/Os during deduplication, thus avoiding extra I/Os. We also propose a hash-less delta encoding approach to reduce extra computational overheads. Evaluation results with four real-world datasets show that PFC-delta improves both compression ratio and restore performance, while increasing the backup throughput on all but one datasets. |
---|---|
ISSN: | 0018-9340 1557-9956 |
DOI: | 10.1109/TC.2023.3318404 |