Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management
High-performance computing platforms such as supercomputers have traditionally been designed to meet the compute demands of scientific applications. Consequently, they have been architected as producers and not consumers of data. The Apache Hadoop ecosystem has evolved to meet the requirements of da...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | High-performance computing platforms such as supercomputers have
traditionally been designed to meet the compute demands of scientific
applications. Consequently, they have been architected as producers and not
consumers of data. The Apache Hadoop ecosystem has evolved to meet the
requirements of data processing applications and has addressed many of the
limitations of HPC platforms. There exist a class of scientific applications
however, that need the collective capabilities of traditional high-performance
computing environments and the Apache Hadoop ecosystem. For example, the
scientific domains of bio-molecular dynamics, genomics and network science need
to couple traditional computing with Hadoop/Spark based analysis. We
investigate the critical question of how to present the capabilities of both
computing environments to such scientific applications. Whereas this questions
needs answers at multiple levels, we focus on the design of resource management
middleware that might support the needs of both. We propose extensions to the
Pilot-Abstraction to provide a unifying resource management layer. This is an
important step that allows applications to integrate HPC stages (e.g.
simulations) to data analytics. Many supercomputing centers have started to
officially support Hadoop environments, either in a dedicated environment or in
hybrid deployments using tools such as myHadoop. This typically involves many
intrinsic, environment-specific details that need to be mastered, and often
swamp conceptual issues like: How best to couple HPC and Hadoop application
stages? How to explore runtime trade-offs (data localities vs. data movement)?
This paper provides both conceptual understanding and practical solutions to
the integrated use of HPC and Hadoop environments. |
---|---|
DOI: | 10.48550/arxiv.1602.00345 |