Progressive Ensemble Learning for in-Sample Data Cleaning

We present an ensemble learning-based data cleaning approach (touted as ELDC) capable of identifying and pruning anomaly data. ELDC is characterized in that an ensemble of base models can be trained directly with the noisy in-sample data and can dynamically provide clean data during the iterative tr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2024, Vol.12, p.140643-140659
Hauptverfasser:	Wang, Jung-Hua, Lee, Shih-Kai, Wang, Ting-Yuan, Chen, Ming-Jer, Hsu, Shu-Wei
Format:	Artikel
Sprache:	eng
Schlagworte:	Cleaning Complexity theory Convolutional neural networks Data analysis data cleanliness Data integrity Data models Datasets Ensemble learning Image classification Iterative methods Neural networks Noise measurement Noisy data Outliers (statistics) Training Training data Transfer learning true labels
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	140659
container_issue
container_start_page	140643
container_title	IEEE access
container_volume	12
creator	Wang, Jung-Hua Lee, Shih-Kai Wang, Ting-Yuan Chen, Ming-Jer Hsu, Shu-Wei
description	We present an ensemble learning-based data cleaning approach (touted as ELDC) capable of identifying and pruning anomaly data. ELDC is characterized in that an ensemble of base models can be trained directly with the noisy in-sample data and can dynamically provide clean data during the iterative training. Each base model uses a random subset of the target dataset that may initially contain up to 40% of label errors. Following each training iteration, anomaly data are discriminated against clean ones by a majority voting scheme, and three different types of anomaly (mislabeled, confusing, and outliers) can be identified using a statistical pattern jointly determined by the prediction output of the base models. By iterating such a cycle of train-vote-remove, noisy in-sample data are progressively removed until a prespecified condition is reached. Comprehensive experiments, including out-sample data tests, are conducted to verify the effectiveness of ELDC in simultaneously suppressing bias and variance of the prediction output. The ELDC framework is highly flexible as it is not bound to a specific model and allows different transfer-learning configurations. Neural networks of AlexNet, ResNet50, and GoogleNet are used as based models and trained with various benchmark datasets, the results show that ELDC outperforms state-of-the-art cleaning methods.
doi_str_mv	10.1109/ACCESS.2024.3468035
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_ACCESS_2024_3468035</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10693472</ieee_id><doaj_id>oai_doaj_org_article_7e4c9f38266b4fc7ac50bdf76a2f947d</doaj_id><sourcerecordid>3112921840</sourcerecordid><originalsourceid>FETCH-LOGICAL-c244t-99191f15da597ffb0eeb1b39df19e9db6abd276fe24d2fd4b997378114b72d6a3</originalsourceid><addsrcrecordid>eNpNUE1LAzEQDaJgqf0FeljwvDVfm2yOZa1aKChUzyHZTMqW7aYmW8F_79YV6VxmePPem-EhdEvwnBCsHhZVtdxs5hRTPmdclJgVF2hCiVA5K5i4PJuv0SylHR6qHKBCTpB6i2EbIaXmC7Jll2BvW8jWYGLXdNvMh5g1Xb4x-8MAP5reZFUL5rS7QVfetAlmf32KPp6W79VLvn59XlWLdV5TzvtcKaKIJ4UzhZLeWwxgiWXKeaJAOSuMdVQKD5Q76h23SkkmS0K4ldQJw6ZoNfq6YHb6EJu9id86mEb_AiFutYl9U7egJfBaeVZSISz3tTR1ga3zUhjqFZdu8LofvQ4xfB4h9XoXjrEb3teMEKooKTkeWGxk1TGkFMH_XyVYnyLXY-T6FLn-i3xQ3Y2qBgDOFEIxLin7AZlKfCA</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3112921840</pqid></control><display><type>article</type><title>Progressive Ensemble Learning for in-Sample Data Cleaning</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Wang, Jung-Hua ; Lee, Shih-Kai ; Wang, Ting-Yuan ; Chen, Ming-Jer ; Hsu, Shu-Wei</creator><creatorcontrib>Wang, Jung-Hua ; Lee, Shih-Kai ; Wang, Ting-Yuan ; Chen, Ming-Jer ; Hsu, Shu-Wei</creatorcontrib><description>We present an ensemble learning-based data cleaning approach (touted as ELDC) capable of identifying and pruning anomaly data. ELDC is characterized in that an ensemble of base models can be trained directly with the noisy in-sample data and can dynamically provide clean data during the iterative training. Each base model uses a random subset of the target dataset that may initially contain up to 40% of label errors. Following each training iteration, anomaly data are discriminated against clean ones by a majority voting scheme, and three different types of anomaly (mislabeled, confusing, and outliers) can be identified using a statistical pattern jointly determined by the prediction output of the base models. By iterating such a cycle of train-vote-remove, noisy in-sample data are progressively removed until a prespecified condition is reached. Comprehensive experiments, including out-sample data tests, are conducted to verify the effectiveness of ELDC in simultaneously suppressing bias and variance of the prediction output. The ELDC framework is highly flexible as it is not bound to a specific model and allows different transfer-learning configurations. Neural networks of AlexNet, ResNet50, and GoogleNet are used as based models and trained with various benchmark datasets, the results show that ELDC outperforms state-of-the-art cleaning methods.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3468035</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Cleaning ; Complexity theory ; Convolutional neural networks ; Data analysis ; data cleanliness ; Data integrity ; Data models ; Datasets ; Ensemble learning ; Image classification ; Iterative methods ; Neural networks ; Noise measurement ; Noisy data ; Outliers (statistics) ; Training ; Training data ; Transfer learning ; true labels</subject><ispartof>IEEE access, 2024, Vol.12, p.140643-140659</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c244t-99191f15da597ffb0eeb1b39df19e9db6abd276fe24d2fd4b997378114b72d6a3</cites><orcidid>0000-0003-1769-7396 ; 0000-0002-5618-1969 ; 0000-0002-2648-9532 ; 0009-0008-2914-8996</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10693472$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Wang, Jung-Hua</creatorcontrib><creatorcontrib>Lee, Shih-Kai</creatorcontrib><creatorcontrib>Wang, Ting-Yuan</creatorcontrib><creatorcontrib>Chen, Ming-Jer</creatorcontrib><creatorcontrib>Hsu, Shu-Wei</creatorcontrib><title>Progressive Ensemble Learning for in-Sample Data Cleaning</title><title>IEEE access</title><addtitle>Access</addtitle><description>We present an ensemble learning-based data cleaning approach (touted as ELDC) capable of identifying and pruning anomaly data. ELDC is characterized in that an ensemble of base models can be trained directly with the noisy in-sample data and can dynamically provide clean data during the iterative training. Each base model uses a random subset of the target dataset that may initially contain up to 40% of label errors. Following each training iteration, anomaly data are discriminated against clean ones by a majority voting scheme, and three different types of anomaly (mislabeled, confusing, and outliers) can be identified using a statistical pattern jointly determined by the prediction output of the base models. By iterating such a cycle of train-vote-remove, noisy in-sample data are progressively removed until a prespecified condition is reached. Comprehensive experiments, including out-sample data tests, are conducted to verify the effectiveness of ELDC in simultaneously suppressing bias and variance of the prediction output. The ELDC framework is highly flexible as it is not bound to a specific model and allows different transfer-learning configurations. Neural networks of AlexNet, ResNet50, and GoogleNet are used as based models and trained with various benchmark datasets, the results show that ELDC outperforms state-of-the-art cleaning methods.</description><subject>Cleaning</subject><subject>Complexity theory</subject><subject>Convolutional neural networks</subject><subject>Data analysis</subject><subject>data cleanliness</subject><subject>Data integrity</subject><subject>Data models</subject><subject>Datasets</subject><subject>Ensemble learning</subject><subject>Image classification</subject><subject>Iterative methods</subject><subject>Neural networks</subject><subject>Noise measurement</subject><subject>Noisy data</subject><subject>Outliers (statistics)</subject><subject>Training</subject><subject>Training data</subject><subject>Transfer learning</subject><subject>true labels</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUE1LAzEQDaJgqf0FeljwvDVfm2yOZa1aKChUzyHZTMqW7aYmW8F_79YV6VxmePPem-EhdEvwnBCsHhZVtdxs5hRTPmdclJgVF2hCiVA5K5i4PJuv0SylHR6qHKBCTpB6i2EbIaXmC7Jll2BvW8jWYGLXdNvMh5g1Xb4x-8MAP5reZFUL5rS7QVfetAlmf32KPp6W79VLvn59XlWLdV5TzvtcKaKIJ4UzhZLeWwxgiWXKeaJAOSuMdVQKD5Q76h23SkkmS0K4ldQJw6ZoNfq6YHb6EJu9id86mEb_AiFutYl9U7egJfBaeVZSISz3tTR1ga3zUhjqFZdu8LofvQ4xfB4h9XoXjrEb3teMEKooKTkeWGxk1TGkFMH_XyVYnyLXY-T6FLn-i3xQ3Y2qBgDOFEIxLin7AZlKfCA</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Wang, Jung-Hua</creator><creator>Lee, Shih-Kai</creator><creator>Wang, Ting-Yuan</creator><creator>Chen, Ming-Jer</creator><creator>Hsu, Shu-Wei</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-1769-7396</orcidid><orcidid>https://orcid.org/0000-0002-5618-1969</orcidid><orcidid>https://orcid.org/0000-0002-2648-9532</orcidid><orcidid>https://orcid.org/0009-0008-2914-8996</orcidid></search><sort><creationdate>2024</creationdate><title>Progressive Ensemble Learning for in-Sample Data Cleaning</title><author>Wang, Jung-Hua ; Lee, Shih-Kai ; Wang, Ting-Yuan ; Chen, Ming-Jer ; Hsu, Shu-Wei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c244t-99191f15da597ffb0eeb1b39df19e9db6abd276fe24d2fd4b997378114b72d6a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cleaning</topic><topic>Complexity theory</topic><topic>Convolutional neural networks</topic><topic>Data analysis</topic><topic>data cleanliness</topic><topic>Data integrity</topic><topic>Data models</topic><topic>Datasets</topic><topic>Ensemble learning</topic><topic>Image classification</topic><topic>Iterative methods</topic><topic>Neural networks</topic><topic>Noise measurement</topic><topic>Noisy data</topic><topic>Outliers (statistics)</topic><topic>Training</topic><topic>Training data</topic><topic>Transfer learning</topic><topic>true labels</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Jung-Hua</creatorcontrib><creatorcontrib>Lee, Shih-Kai</creatorcontrib><creatorcontrib>Wang, Ting-Yuan</creatorcontrib><creatorcontrib>Chen, Ming-Jer</creatorcontrib><creatorcontrib>Hsu, Shu-Wei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Jung-Hua</au><au>Lee, Shih-Kai</au><au>Wang, Ting-Yuan</au><au>Chen, Ming-Jer</au><au>Hsu, Shu-Wei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Progressive Ensemble Learning for in-Sample Data Cleaning</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024</date><risdate>2024</risdate><volume>12</volume><spage>140643</spage><epage>140659</epage><pages>140643-140659</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>We present an ensemble learning-based data cleaning approach (touted as ELDC) capable of identifying and pruning anomaly data. ELDC is characterized in that an ensemble of base models can be trained directly with the noisy in-sample data and can dynamically provide clean data during the iterative training. Each base model uses a random subset of the target dataset that may initially contain up to 40% of label errors. Following each training iteration, anomaly data are discriminated against clean ones by a majority voting scheme, and three different types of anomaly (mislabeled, confusing, and outliers) can be identified using a statistical pattern jointly determined by the prediction output of the base models. By iterating such a cycle of train-vote-remove, noisy in-sample data are progressively removed until a prespecified condition is reached. Comprehensive experiments, including out-sample data tests, are conducted to verify the effectiveness of ELDC in simultaneously suppressing bias and variance of the prediction output. The ELDC framework is highly flexible as it is not bound to a specific model and allows different transfer-learning configurations. Neural networks of AlexNet, ResNet50, and GoogleNet are used as based models and trained with various benchmark datasets, the results show that ELDC outperforms state-of-the-art cleaning methods.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3468035</doi><tpages>17</tpages><orcidid>https://orcid.org/0000-0003-1769-7396</orcidid><orcidid>https://orcid.org/0000-0002-5618-1969</orcidid><orcidid>https://orcid.org/0000-0002-2648-9532</orcidid><orcidid>https://orcid.org/0009-0008-2914-8996</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2024, Vol.12, p.140643-140659
issn	2169-3536 2169-3536
language	eng
recordid	cdi_crossref_primary_10_1109_ACCESS_2024_3468035
source	IEEE Open Access Journals; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals
subjects	Cleaning Complexity theory Convolutional neural networks Data analysis data cleanliness Data integrity Data models Datasets Ensemble learning Image classification Iterative methods Neural networks Noise measurement Noisy data Outliers (statistics) Training Training data Transfer learning true labels
title	Progressive Ensemble Learning for in-Sample Data Cleaning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T20%3A34%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Progressive%20Ensemble%20Learning%20for%20in-Sample%20Data%20Cleaning&rft.jtitle=IEEE%20access&rft.au=Wang,%20Jung-Hua&rft.date=2024&rft.volume=12&rft.spage=140643&rft.epage=140659&rft.pages=140643-140659&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3468035&rft_dat=%3Cproquest_cross%3E3112921840%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3112921840&rft_id=info:pmid/&rft_ieee_id=10693472&rft_doaj_id=oai_doaj_org_article_7e4c9f38266b4fc7ac50bdf76a2f947d&rfr_iscdi=true