ActiveClean: interactive data cleaning for statistical modeling

Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose Activ...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings of the VLDB Endowment 2016-08, Vol.9 (12), p.948-959
Hauptverfasser:	Krishnan, Sanjay, Wang, Jiannan, Wu, Eugene, Franklin, Michael J., Goldberg, Ken
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	959
container_issue	12
container_start_page	948
container_title	Proceedings of the VLDB Endowment
container_volume	9
creator	Krishnan, Sanjay Wang, Jiannan Wu, Eugene Franklin, Michael J. Goldberg, Ken
description	Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports an important class of models called convex loss models (e.g., linear regression and SVMs), and prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, IMDB, and Dollars For Docs with both real and synthetic errors. The results show that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.
doi_str_mv	10.14778/2994509.2994514
format	Article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_14778_2994509_2994514</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_14778_2994509_2994514</sourcerecordid><originalsourceid>FETCH-LOGICAL-c196t-5de99236be2e5e74524d42df31c85d1f1f2a8a7d4cf1d5b7df3f10686f94270e3</originalsourceid><addsrcrecordid>eNpNj8sKwjAURIMoqNW9P1G9N02aZCnFFwhudB1icgOKLxoR_HtfXbg6AwMzHMZGCGMUSukJN0ZIMOMvUbRYj6OEXINR7b_cZf2UjgClLlH3WDb198ODqhO5y4B1ojslGjbM2G4-21bLfL1ZrKrpOvdoynsuAxnDi3JPnCQpIbkIgodYoNcyYMTInXYqCB8xyL16NxE_f9EIroCKjMFv19fXlGqK9lYfzq5-WgT7lbGNjG1kihdwSDpd</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>ActiveClean: interactive data cleaning for statistical modeling</title><source>ACM Digital Library Complete</source><creator>Krishnan, Sanjay ; Wang, Jiannan ; Wu, Eugene ; Franklin, Michael J. ; Goldberg, Ken</creator><creatorcontrib>Krishnan, Sanjay ; Wang, Jiannan ; Wu, Eugene ; Franklin, Michael J. ; Goldberg, Ken</creatorcontrib><description>Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports an important class of models called convex loss models (e.g., linear regression and SVMs), and prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, IMDB, and Dollars For Docs with both real and synthetic errors. The results show that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.</description><identifier>ISSN: 2150-8097</identifier><identifier>EISSN: 2150-8097</identifier><identifier>DOI: 10.14778/2994509.2994514</identifier><language>eng</language><ispartof>Proceedings of the VLDB Endowment, 2016-08, Vol.9 (12), p.948-959</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c196t-5de99236be2e5e74524d42df31c85d1f1f2a8a7d4cf1d5b7df3f10686f94270e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Krishnan, Sanjay</creatorcontrib><creatorcontrib>Wang, Jiannan</creatorcontrib><creatorcontrib>Wu, Eugene</creatorcontrib><creatorcontrib>Franklin, Michael J.</creatorcontrib><creatorcontrib>Goldberg, Ken</creatorcontrib><title>ActiveClean: interactive data cleaning for statistical modeling</title><title>Proceedings of the VLDB Endowment</title><description>Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports an important class of models called convex loss models (e.g., linear regression and SVMs), and prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, IMDB, and Dollars For Docs with both real and synthetic errors. The results show that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.</description><issn>2150-8097</issn><issn>2150-8097</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><recordid>eNpNj8sKwjAURIMoqNW9P1G9N02aZCnFFwhudB1icgOKLxoR_HtfXbg6AwMzHMZGCGMUSukJN0ZIMOMvUbRYj6OEXINR7b_cZf2UjgClLlH3WDb198ODqhO5y4B1ojslGjbM2G4-21bLfL1ZrKrpOvdoynsuAxnDi3JPnCQpIbkIgodYoNcyYMTInXYqCB8xyL16NxE_f9EIroCKjMFv19fXlGqK9lYfzq5-WgT7lbGNjG1kihdwSDpd</recordid><startdate>20160801</startdate><enddate>20160801</enddate><creator>Krishnan, Sanjay</creator><creator>Wang, Jiannan</creator><creator>Wu, Eugene</creator><creator>Franklin, Michael J.</creator><creator>Goldberg, Ken</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20160801</creationdate><title>ActiveClean</title><author>Krishnan, Sanjay ; Wang, Jiannan ; Wu, Eugene ; Franklin, Michael J. ; Goldberg, Ken</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c196t-5de99236be2e5e74524d42df31c85d1f1f2a8a7d4cf1d5b7df3f10686f94270e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Krishnan, Sanjay</creatorcontrib><creatorcontrib>Wang, Jiannan</creatorcontrib><creatorcontrib>Wu, Eugene</creatorcontrib><creatorcontrib>Franklin, Michael J.</creatorcontrib><creatorcontrib>Goldberg, Ken</creatorcontrib><collection>CrossRef</collection><jtitle>Proceedings of the VLDB Endowment</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Krishnan, Sanjay</au><au>Wang, Jiannan</au><au>Wu, Eugene</au><au>Franklin, Michael J.</au><au>Goldberg, Ken</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ActiveClean: interactive data cleaning for statistical modeling</atitle><jtitle>Proceedings of the VLDB Endowment</jtitle><date>2016-08-01</date><risdate>2016</risdate><volume>9</volume><issue>12</issue><spage>948</spage><epage>959</epage><pages>948-959</pages><issn>2150-8097</issn><eissn>2150-8097</eissn><abstract>Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports an important class of models called convex loss models (e.g., linear regression and SVMs), and prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, IMDB, and Dollars For Docs with both real and synthetic errors. The results show that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.</abstract><doi>10.14778/2994509.2994514</doi><tpages>12</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 2150-8097
ispartof	Proceedings of the VLDB Endowment, 2016-08, Vol.9 (12), p.948-959
issn	2150-8097 2150-8097
language	eng
recordid	cdi_crossref_primary_10_14778_2994509_2994514
source	ACM Digital Library Complete
title	ActiveClean: interactive data cleaning for statistical modeling
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T23%3A57%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ActiveClean:%20interactive%20data%20cleaning%20for%20statistical%20modeling&rft.jtitle=Proceedings%20of%20the%20VLDB%20Endowment&rft.au=Krishnan,%20Sanjay&rft.date=2016-08-01&rft.volume=9&rft.issue=12&rft.spage=948&rft.epage=959&rft.pages=948-959&rft.issn=2150-8097&rft.eissn=2150-8097&rft_id=info:doi/10.14778/2994509.2994514&rft_dat=%3Ccrossref%3E10_14778_2994509_2994514%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true