The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck

Shani et al offer a taxonomy of the methods used to obtain quality datasets enhances existing resources. Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Communications of the ACM 2023-02, Vol.66 (2), p.92-102
Hauptverfasser: Shani, Chen, Zarecki, Jonathan, Shahaf, Dafna
Format: Magazinearticle
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 102
container_issue 2
container_start_page 92
container_title Communications of the ACM
container_volume 66
creator Shani, Chen
Zarecki, Jonathan
Shahaf, Dafna
description Shani et al offer a taxonomy of the methods used to obtain quality datasets enhances existing resources. Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally making the transition from expensive-to-develop, highly specialized code to something more like a commodity, these models involve millions (or even billions) of parameters and require massive amounts of data to train. Thus, the dominant paradigm in ML today is to create a new (large) dataset whenever facing a novel task. In fact, there are now entire conferences dedicated to the creation of new data resources. While this approach resulted in significant advances, it suffers from a major caveat, as collecting large, high-quality datasets is often very demanding in terms of time and human resources.
doi_str_mv 10.1145/3551635
format Magazinearticle
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2776201064</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2776201064</sourcerecordid><originalsourceid>FETCH-LOGICAL-a234t-4979195326aef7c1f3bdc294445478c9ba3043975b81f6c8c76681b2eb7673053</originalsourceid><addsrcrecordid>eNo9z0FLxDAQBeAgCtZVPAqeCh48VTNJJtMcZVd3hYIH13NIY4pd3HZNugf_vZWunoZhPt7wGLsEfgeg8F4igpZ4xDJApIIk0DHLOOdQcCrFKTtLaTOuHDVm7Gr9EfIquC5fuMHlr74N3dCm4ZydNO4zhYvDnLG3p8f1fFVUL8vn-UNVOCHVUChDBgxKoV1oyEMj63cvjFIKFZXe1E5yJQ1hXUKjfelJ6xJqEWrSJDnKGbuZcnex_9qHNNhNv4_d-NIKIi04cK1GdTspH_uUYmjsLrZbF78tcPvb2h5aj_J6ks5v_9Hf8Qf0bEze</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>magazinearticle</recordtype><pqid>2776201064</pqid></control><display><type>magazinearticle</type><title>The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck</title><source>Alma/SFX Local Collection</source><source>EBSCOhost Business Source Complete</source><creator>Shani, Chen ; Zarecki, Jonathan ; Shahaf, Dafna</creator><creatorcontrib>Shani, Chen ; Zarecki, Jonathan ; Shahaf, Dafna</creatorcontrib><description>Shani et al offer a taxonomy of the methods used to obtain quality datasets enhances existing resources. Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally making the transition from expensive-to-develop, highly specialized code to something more like a commodity, these models involve millions (or even billions) of parameters and require massive amounts of data to train. Thus, the dominant paradigm in ML today is to create a new (large) dataset whenever facing a novel task. In fact, there are now entire conferences dedicated to the creation of new data resources. While this approach resulted in significant advances, it suffers from a major caveat, as collecting large, high-quality datasets is often very demanding in terms of time and human resources.</description><identifier>ISSN: 0001-0782</identifier><identifier>EISSN: 1557-7317</identifier><identifier>DOI: 10.1145/3551635</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Computing methodologies ; Data analysis ; Datasets ; Deep learning ; Information systems ; Information systems applications ; Learning ; Machine learning ; Machine learning approaches ; Taxonomy</subject><ispartof>Communications of the ACM, 2023-02, Vol.66 (2), p.92-102</ispartof><rights>ACM</rights><rights>Copyright Association for Computing Machinery Feb 2023</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a234t-4979195326aef7c1f3bdc294445478c9ba3043975b81f6c8c76681b2eb7673053</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780,27902</link.rule.ids></links><search><creatorcontrib>Shani, Chen</creatorcontrib><creatorcontrib>Zarecki, Jonathan</creatorcontrib><creatorcontrib>Shahaf, Dafna</creatorcontrib><title>The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck</title><title>Communications of the ACM</title><addtitle>ACM CACM</addtitle><description>Shani et al offer a taxonomy of the methods used to obtain quality datasets enhances existing resources. Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally making the transition from expensive-to-develop, highly specialized code to something more like a commodity, these models involve millions (or even billions) of parameters and require massive amounts of data to train. Thus, the dominant paradigm in ML today is to create a new (large) dataset whenever facing a novel task. In fact, there are now entire conferences dedicated to the creation of new data resources. While this approach resulted in significant advances, it suffers from a major caveat, as collecting large, high-quality datasets is often very demanding in terms of time and human resources.</description><subject>Computing methodologies</subject><subject>Data analysis</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Information systems</subject><subject>Information systems applications</subject><subject>Learning</subject><subject>Machine learning</subject><subject>Machine learning approaches</subject><subject>Taxonomy</subject><issn>0001-0782</issn><issn>1557-7317</issn><fulltext>true</fulltext><rsrctype>magazinearticle</rsrctype><creationdate>2023</creationdate><recordtype>magazinearticle</recordtype><recordid>eNo9z0FLxDAQBeAgCtZVPAqeCh48VTNJJtMcZVd3hYIH13NIY4pd3HZNugf_vZWunoZhPt7wGLsEfgeg8F4igpZ4xDJApIIk0DHLOOdQcCrFKTtLaTOuHDVm7Gr9EfIquC5fuMHlr74N3dCm4ZydNO4zhYvDnLG3p8f1fFVUL8vn-UNVOCHVUChDBgxKoV1oyEMj63cvjFIKFZXe1E5yJQ1hXUKjfelJ6xJqEWrSJDnKGbuZcnex_9qHNNhNv4_d-NIKIi04cK1GdTspH_uUYmjsLrZbF78tcPvb2h5aj_J6ks5v_9Hf8Qf0bEze</recordid><startdate>20230201</startdate><enddate>20230201</enddate><creator>Shani, Chen</creator><creator>Zarecki, Jonathan</creator><creator>Shahaf, Dafna</creator><general>ACM</general><general>Association for Computing Machinery</general><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope></search><sort><creationdate>20230201</creationdate><title>The Lean Data Scientist</title><author>Shani, Chen ; Zarecki, Jonathan ; Shahaf, Dafna</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a234t-4979195326aef7c1f3bdc294445478c9ba3043975b81f6c8c76681b2eb7673053</frbrgroupid><rsrctype>magazinearticle</rsrctype><prefilter>magazinearticle</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computing methodologies</topic><topic>Data analysis</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Information systems</topic><topic>Information systems applications</topic><topic>Learning</topic><topic>Machine learning</topic><topic>Machine learning approaches</topic><topic>Taxonomy</topic><toplevel>online_resources</toplevel><creatorcontrib>Shani, Chen</creatorcontrib><creatorcontrib>Zarecki, Jonathan</creatorcontrib><creatorcontrib>Shahaf, Dafna</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><jtitle>Communications of the ACM</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shani, Chen</au><au>Zarecki, Jonathan</au><au>Shahaf, Dafna</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck</atitle><jtitle>Communications of the ACM</jtitle><stitle>ACM CACM</stitle><date>2023-02-01</date><risdate>2023</risdate><volume>66</volume><issue>2</issue><spage>92</spage><epage>102</epage><pages>92-102</pages><issn>0001-0782</issn><eissn>1557-7317</eissn><abstract>Shani et al offer a taxonomy of the methods used to obtain quality datasets enhances existing resources. Obtaining data has become the key bottleneck in many machine-learning (ML) applications. The rise of deep learning has further exacerbated this issue. Although high-quality ML models are finally making the transition from expensive-to-develop, highly specialized code to something more like a commodity, these models involve millions (or even billions) of parameters and require massive amounts of data to train. Thus, the dominant paradigm in ML today is to create a new (large) dataset whenever facing a novel task. In fact, there are now entire conferences dedicated to the creation of new data resources. While this approach resulted in significant advances, it suffers from a major caveat, as collecting large, high-quality datasets is often very demanding in terms of time and human resources.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3551635</doi><tpages>11</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0001-0782
ispartof Communications of the ACM, 2023-02, Vol.66 (2), p.92-102
issn 0001-0782
1557-7317
language eng
recordid cdi_proquest_journals_2776201064
source Alma/SFX Local Collection; EBSCOhost Business Source Complete
subjects Computing methodologies
Data analysis
Datasets
Deep learning
Information systems
Information systems applications
Learning
Machine learning
Machine learning approaches
Taxonomy
title The Lean Data Scientist: Recent Advances Toward Overcoming the Data Bottleneck
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-14T20%3A37%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Lean%20Data%20Scientist:%20Recent%20Advances%20Toward%20Overcoming%20the%20Data%20Bottleneck&rft.jtitle=Communications%20of%20the%20ACM&rft.au=Shani,%20Chen&rft.date=2023-02-01&rft.volume=66&rft.issue=2&rft.spage=92&rft.epage=102&rft.pages=92-102&rft.issn=0001-0782&rft.eissn=1557-7317&rft_id=info:doi/10.1145/3551635&rft_dat=%3Cproquest_cross%3E2776201064%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2776201064&rft_id=info:pmid/&rfr_iscdi=true