Improving metadata management for small files in HDFS

Scientific applications are adapting HDFS/MapReduce to perform large scale data analytics. One of the major challenges is that an overabundance of small files is common in these applications, and HDFS manages all its files through a single server, the Namenode. It is anticipated that small files can...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Mackey, G., Sehrish, S., Jun Wang
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Application software Availability Computer architecture Computer science Data analysis Engineering management File servers File systems Large-scale systems Performance analysis
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Scientific applications are adapting HDFS/MapReduce to perform large scale data analytics. One of the major challenges is that an overabundance of small files is common in these applications, and HDFS manages all its files through a single server, the Namenode. It is anticipated that small files can significantly impact the performance of Namenode. In this work we propose a mechanism to store small files in HDFS efficiently and improve the space utilization for metadata. Our scheme is based on the assumption that each client is assigned a quota in the file system, for both the space and number of files. In our approach, we utilize the compression method `harballing', provided by Hadoop, to better utilize the HDFS. We provide for new job functionality to allow for in-job archival of directories and files so that running MapReduce programs may complete without being killed by the jobtracker due to quota policies. This approach leads to better functionality of metadata operations and more efficient usage of the HDFS. Our analysis results show that we can reduce the metadata footprint in main memory by a factor of 42.
ISSN:	1552-5244 2168-9253
DOI:	10.1109/CLUSTR.2009.5289133