Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers
Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on optimizing these frameworks, including their storage management. T...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data processing frameworks such as Apache Beam and Apache Spark are used for
a wide range of applications, from logs analysis to data preparation for DNN
training. It is thus unsurprising that there has been a large amount of work on
optimizing these frameworks, including their storage management. The shift to
cloud computing requires optimization across all pipelines concurrently running
across a cluster. In this paper, we look at one specific instance of this
problem: placement of I/O-intensive temporary intermediate data on SSD and HDD.
Efficient data placement is challenging since I/O density is usually unknown at
the time data needs to be placed. Additionally, external factors such as load
variability, job preemption, or job priorities can impact job completion times,
which ultimately affect the I/O density of the temporary files in the workload.
In this paper, we envision that machine learning can be used to solve this
problem. We analyze production logs from Google's data centers for a range of
data processing pipelines. Our analysis shows that I/O density may be
predictable. This suggests that learning-based strategies, if crafted
carefully, could extract predictive features for I/O density of temporary files
involved in various transformations, which could be used to improve the
efficiency of storage management in data processing pipelines. |
---|---|
DOI: | 10.48550/arxiv.2211.02286 |