The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck
Shani et al offer a taxonomy of the methods used to obtain quality datasets enhances existing resources. Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally...
Gespeichert in:
Veröffentlicht in: | Communications of the ACM 2023-02, Vol.66 (2), p.92-102 |
---|---|
Hauptverfasser: | , , |
Format: | Magazinearticle |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Shani et al offer a taxonomy of the methods used to obtain quality datasets enhances existing resources. Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally making the transition from expensive-to-develop, highly specialized code to something more like a commodity, these models involve millions (or even billions) of parameters and require massive amounts of data to train. Thus, the dominant paradigm in ML today is to create a new (large) dataset whenever facing a novel task. In fact, there are now entire conferences dedicated to the creation of new data resources. While this approach resulted in significant advances, it suffers from a major caveat, as collecting large, high-quality datasets is often very demanding in terms of time and human resources. |
---|---|
ISSN: | 0001-0782 1557-7317 |
DOI: | 10.1145/3551635 |