Efficient caching and data access to a remote data lake in a large scale data processing environment

Embodiments described herein are generally directed to caching and data access improvements in a large scale data processing environment. According to an example, an agent running on a first worker node of a cluster receives a read request from a task. The worker node of the cluster to which the dat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Ou, Xiongbing, Lee, David, Phelan, Thomas Anthony
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Embodiments described herein are generally directed to caching and data access improvements in a large scale data processing environment. According to an example, an agent running on a first worker node of a cluster receives a read request from a task. The worker node of the cluster to which the data at issue is mapped is identified. When the first worker node is the identified worker node, it is determined whether its cache contains the data; if so, the data is fetched from a remote data lake and the agent locally caches the data; otherwise, when the identified worker node is another worker node of the compute cluster, the data is fetched from a remote agent of that worker node. The agent responds to the read request with cached data, data returned by the remote data lake, or data returned by the remote data agent as the case may be.