Cloud based industrial file handling and duplication removal using source based deduplication technique

Data Deduplication is defined as an elimination process of redundant duplicated data; this process is stratified by using a unique value that been represented by a chunk of data, which is referenced by the original file that contains this chunk. Data Deduplication techniques have been mainly applied...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Majed, Samer O., Thamer, Sawsan K.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Data Deduplication is defined as an elimination process of redundant duplicated data; this process is stratified by using a unique value that been represented by a chunk of data, which is referenced by the original file that contains this chunk. Data Deduplication techniques have been mainly applied in the cloud-based systems in order to decrease the space size of the data storage, and enhance the connection bandwidth. In this paper, we have introduced a data deduplication optimization technique that been applied to the data storage of cloud-based systems. The proposed technique optimizes data deduplication by implementing Source-Based and In-Line based techniques. The Source-Based method is stratified at the source that contains the data, on the other hand, the In-Line method is stratified at the RAM that contains the data momentarily before the writing process of the I/O. Moreover, the proposed technique applies a Content-Based chunking algorithm with Variable Chunking utilization by using Rabin Karp Rolling Hash (RKRH). RKRH is a data chunking algorithm that breaks data files into different segments sizes. Generally, the proposed technique process is based on calculating the hash value of data chunk which considered as a fingerprint. Afterward, the chunk availability process is applied in order to identify the existence of this chunk in the storage; therefore, if this chunk does not exist in the storage a reference to this chunk is added and store the hash value as a key in the storage. The proposed technique includes data chunk compression to reduce the data redundancy in the same chunk. Practically, the proposed technique provides a data deduplication ratio of 33 percent and an average upload latency of five seconds. Finally, the proposed approach utilized with any data file type as a byte stream.
ISSN:0094-243X
1551-7616
DOI:10.1063/5.0030989