The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck

Shani et al offer a taxonomy of the methods used to obtain quality datasets enhances existing resources. Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Communications of the ACM 2023-02, Vol.66 (2), p.92-102
Hauptverfasser: Shani, Chen, Zarecki, Jonathan, Shahaf, Dafna
Format: Magazinearticle
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Shani et al offer a taxonomy of the methods used to obtain quality datasets enhances existing resources. Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally making the transition from expensive-to-develop, highly specialized code to something more like a commodity, these models involve millions (or even billions) of parameters and require massive amounts of data to train. Thus, the dominant paradigm in ML today is to create a new (large) dataset whenever facing a novel task. In fact, there are now entire conferences dedicated to the creation of new data resources. While this approach resulted in significant advances, it suffers from a major caveat, as collecting large, high-quality datasets is often very demanding in terms of time and human resources.
ISSN:0001-0782
1557-7317
DOI:10.1145/3551635