Cogset: A Unified Engine for Reliable Storage and Parallel Processing

MapReduce has become a popular paradigm for parallel data processing, both for ad-hoc schema-less processing using a simple functional interface, and as a building block for higher-level abstractions. Much subsequent work has layered additional functionality on top of MapReduce or similar infrastruc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Valvag, S.V., Johansen, D.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:MapReduce has become a popular paradigm for parallel data processing, both for ad-hoc schema-less processing using a simple functional interface, and as a building block for higher-level abstractions. Much subsequent work has layered additional functionality on top of MapReduce or similar infrastructures, building powerful software stacks for distributed applications. In this paper, we present Cogset, the result of re-thinking the original MapReduce architecture that sits at the bottom of the stack. We observe that the traditional loose coupling between the distributed file system and the MapReduce processing engine leads to poor data locality for many applications. Accordingly, Cogset offers both reliable storage and parallel data processing, fusing the two components into a single system that ensures good data locality. We also take a new approach to data shuffling, relying on highly efficient static routing, and devise new mechanisms for fault tolerance, load balancing and ensuring consistency. We evaluate Cogset using a suite of benchmark applications, comparing it to Hadoop with very favorable results. For example, on a 12-node cluster, an inverted index that takes 80 minutes to build using Hadoop can be constructed using Cogset in less than 35 minutes.
DOI:10.1109/NPC.2009.23