Cloud based industrial file handling and duplication removal using source based deduplication technique

Data Deduplication is defined as an elimination process of redundant duplicated data; this process is stratified by using a unique value that been represented by a chunk of data, which is referenced by the original file that contains this chunk. Data Deduplication techniques have been mainly applied...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Majed, Samer O., Thamer, Sawsan K.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Algorithms Data compression Data storage Optimization Optimization techniques Redundancy Reproduction (copying)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Data Deduplication is defined as an elimination process of redundant duplicated data; this process is stratified by using a unique value that been represented by a chunk of data, which is referenced by the original file that contains this chunk. Data Deduplication techniques have been mainly applied in the cloud-based systems in order to decrease the space size of the data storage, and enhance the connection bandwidth. In this paper, we have introduced a data deduplication optimization technique that been applied to the data storage of cloud-based systems. The proposed technique optimizes data deduplication by implementing Source-Based and In-Line based techniques. The Source-Based method is stratified at the source that contains the data, on the other hand, the In-Line method is stratified at the RAM that contains the data momentarily before the writing process of the I/O. Moreover, the proposed technique applies a Content-Based chunking algorithm with Variable Chunking utilization by using Rabin Karp Rolling Hash (RKRH). RKRH is a data chunking algorithm that breaks data files into different segments sizes. Generally, the proposed technique process is based on calculating the hash value of data chunk which considered as a fingerprint. Afterward, the chunk availability process is applied in order to identify the existence of this chunk in the storage; therefore, if this chunk does not exist in the storage a reference to this chunk is added and store the hash value as a key in the storage. The proposed technique includes data chunk compression to reduce the data redundancy in the same chunk. Practically, the proposed technique provides a data deduplication ratio of 33 percent and an average upload latency of five seconds. Finally, the proposed approach utilized with any data file type as a byte stream.
ISSN:	0094-243X 1551-7616
DOI:	10.1063/5.0030989