T2: a customizable parallel database for multi-dimensional data

As computational power and storage capacity increase, processing and analyzing large volumes of data play an increasingly important part in many domains of scientific research. Typical examples of large scientific datasets include long running simulations of time-dependent phenomena that periodicall...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	SIGMOD record 1998-03, Vol.27 (1), p.58-66
Hauptverfasser:	Chang, Chialin, Acharya, Anurag, Sussman, Alan, Saltz, Joel
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	As computational power and storage capacity increase, processing and analyzing large volumes of data play an increasingly important part in many domains of scientific research. Typical examples of large scientific datasets include long running simulations of time-dependent phenomena that periodically generate snapshots of their state (e.g. hydrodynamics and chemical transport simulation for estimating pollution impact on water bodies [4, 6, 20], magnetohydrodynamics simulation of planetary magnetospheres [32], simulation of a flame sweeping through a volume [28], airplane wake simulations [21]), archives of raw and processed remote sensing data (e.g. AVHRR [25], Thematic Mapper [17], MODIS [22]), and archives of medical images (e.g. confocal light microscopy, CT imaging, MRI, sonography). These datasets are usually multi-dimensional. The data dimensions can be spatial coordinates, time, or experimental conditions such as temperature, velocity or magnetic field. The importance of such datasets has been recognized by several database research groups and vendors, and several systems have been developed for managing and/or visualizing them [2, 7, 14, 19, 26, 27, 29, 31]. These systems, however, focus on lineage management, retrieval and visualization of multi-dimensional datasets. They provide little or no support for analyzing or processing these datasets -- the assumption is that this is too application-specific to warrant common support. As a result, applications that process these datasets are usually decoupled from data storage and management, resulting in inefficiency due to copying and loss of locality. Furthermore, every application developer has to implement complex support for managing and scheduling the processing. Over the past three years, we have been working with several scientific research groups to understand the processing requirements for such applications [1, 5, 6, 10, 18, 23, 24, 28]. Our study of a large set of applications indicates that the processing for such datasets is often highly stylized and shares several important characteristics. Usually, both the input dataset as well as the result being computed have underlying multi-dimensional grids, and queries into the dataset are in the form of ranges within each dimension of the grid. The basic processing step usually consists of transforming individual input items, mapping the transformed items to the output grid and computing output items by aggregating, in some way, all the transform
ISSN:	0163-5808
DOI:	10.1145/273244.273264