The Importance of Balanced Data Sets for Partial Least Squares Discriminant Analysis: Classification Problems Using Hyperspectral Imaging Data

This study investigates the effect of imbalanced spectral data in the training set, when developing partial least squares discriminant analysis (PLS-DA) classification models for use in future predictions. The experimental study was performed using a real hyperspectral short-wavelength infrared imag...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of near infrared spectroscopy (United Kingdom) 2011-01, Vol.19 (4), p.233-241
Hauptverfasser: Lindström, Susanne W., Geladi, Paul, Jonsson, Oskar, Pettersson, Fredrik
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 241
container_issue 4
container_start_page 233
container_title Journal of near infrared spectroscopy (United Kingdom)
container_volume 19
creator Lindström, Susanne W.
Geladi, Paul
Jonsson, Oskar
Pettersson, Fredrik
description This study investigates the effect of imbalanced spectral data in the training set, when developing partial least squares discriminant analysis (PLS-DA) classification models for use in future predictions. The experimental study was performed using a real hyperspectral short-wavelength infrared image data set collected from bakery products (buns) containing contaminants (flies) but similar applications for other insects, paper and plastic were also tested. The contaminants represent a very small proportion of the images relative to the bun. The PLS-DA model aims at accurately detecting and classifying the contaminants and this requires a modification of the calibration data set. The paper deals with problems caused by unbalanced calibration data sets and how to remedy them. In the example it was demonstrated that, by balancing the calibration data from 58,476 bun pixels + 279 fly pixels to 279 bun + 279 fly pixels, the number of true predictions could be improved with a smaller number of PLS components used in the model. The improvement for flies increased from 65% true predictions with ten PLS components to > 99% true prediction with five to six PLS components. The true prediction for bun went from 100% to 99.5% with six PLS components which is an acceptable reduction. Theoretical explanations are included.
doi_str_mv 10.1255/jnirs.932
format Article
fullrecord <record><control><sourceid>sage_swepu</sourceid><recordid>TN_cdi_swepub_primary_oai_slubar_slu_se_58892</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1255_jnirs.932</sage_id><sourcerecordid>10.1255_jnirs.932</sourcerecordid><originalsourceid>FETCH-LOGICAL-c335t-def64691cd21962e2e08eb3d0c0ad223a0ec9e925e79063db22a3db942535193</originalsourceid><addsrcrecordid>eNp1kdtqGzEQhkVpoSbJRd9AV4VCN9UhWlu9c-22MRgaiNtbMauddWX2VI2W4JfoM1ebhF61wzAziE-_GP2MvZHiWipjPpz6EOnaavWCLeTSyKI0Rr1kC2HLZSG0Nq_ZFdFJ5FjllHrBfh9-It914xAT9B750PBP0M5jzbeQgN9jIt4Mkd9BTAFavkegxO9_TRCR-DaQj6ELPfSJr3tozxToI9-0QBSa4CGFoed3caha7Ih_p9Af-e15xEgj-hSz4K6D43w6P3fJXjXQEl499wt2-PL5sLkt9t--7jbrfeHzFqmosSlvSit9raQtFSoUK6x0LbyAWikNAr1FqwwurSh1XSkFudobZbSRVl-w4kmWHnCcKjfmFSCe3QDBUTtVEOfmCJ1ZrazK_Pv_8tvwY-2GeHRTNzkjlkJk_N0T7uNAFLH5e0EKNzvlHp1y2anMvn2WhiO60zDF_In0D_APCRKXbQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>The Importance of Balanced Data Sets for Partial Least Squares Discriminant Analysis: Classification Problems Using Hyperspectral Imaging Data</title><source>SAGE Complete A-Z List</source><creator>Lindström, Susanne W. ; Geladi, Paul ; Jonsson, Oskar ; Pettersson, Fredrik</creator><creatorcontrib>Lindström, Susanne W. ; Geladi, Paul ; Jonsson, Oskar ; Pettersson, Fredrik ; Sveriges lantbruksuniversitet</creatorcontrib><description>This study investigates the effect of imbalanced spectral data in the training set, when developing partial least squares discriminant analysis (PLS-DA) classification models for use in future predictions. The experimental study was performed using a real hyperspectral short-wavelength infrared image data set collected from bakery products (buns) containing contaminants (flies) but similar applications for other insects, paper and plastic were also tested. The contaminants represent a very small proportion of the images relative to the bun. The PLS-DA model aims at accurately detecting and classifying the contaminants and this requires a modification of the calibration data set. The paper deals with problems caused by unbalanced calibration data sets and how to remedy them. In the example it was demonstrated that, by balancing the calibration data from 58,476 bun pixels + 279 fly pixels to 279 bun + 279 fly pixels, the number of true predictions could be improved with a smaller number of PLS components used in the model. The improvement for flies increased from 65% true predictions with ten PLS components to &gt; 99% true prediction with five to six PLS components. The true prediction for bun went from 100% to 99.5% with six PLS components which is an acceptable reduction. Theoretical explanations are included.</description><identifier>ISSN: 0967-0335</identifier><identifier>ISSN: 1751-6552</identifier><identifier>EISSN: 1751-6552</identifier><identifier>DOI: 10.1255/jnirs.932</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>Annan kemi ; classification ; hyperspectral imaging ; obtaining a balanced dataset ; Other Chemistry Topics ; PLS-DA ; unbalanced model</subject><ispartof>Journal of near infrared spectroscopy (United Kingdom), 2011-01, Vol.19 (4), p.233-241</ispartof><rights>2011 Sage Publications</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c335t-def64691cd21962e2e08eb3d0c0ad223a0ec9e925e79063db22a3db942535193</citedby><cites>FETCH-LOGICAL-c335t-def64691cd21962e2e08eb3d0c0ad223a0ec9e925e79063db22a3db942535193</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://journals.sagepub.com/doi/pdf/10.1255/jnirs.932$$EPDF$$P50$$Gsage$$H</linktopdf><linktohtml>$$Uhttps://journals.sagepub.com/doi/10.1255/jnirs.932$$EHTML$$P50$$Gsage$$H</linktohtml><link.rule.ids>230,314,777,781,882,21800,27905,27906,43602,43603</link.rule.ids><backlink>$$Uhttps://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-50700$$DView record from Swedish Publication Index$$Hfree_for_read</backlink><backlink>$$Uhttps://res.slu.se/id/publ/58892$$DView record from Swedish Publication Index$$Hfree_for_read</backlink></links><search><creatorcontrib>Lindström, Susanne W.</creatorcontrib><creatorcontrib>Geladi, Paul</creatorcontrib><creatorcontrib>Jonsson, Oskar</creatorcontrib><creatorcontrib>Pettersson, Fredrik</creatorcontrib><creatorcontrib>Sveriges lantbruksuniversitet</creatorcontrib><title>The Importance of Balanced Data Sets for Partial Least Squares Discriminant Analysis: Classification Problems Using Hyperspectral Imaging Data</title><title>Journal of near infrared spectroscopy (United Kingdom)</title><description>This study investigates the effect of imbalanced spectral data in the training set, when developing partial least squares discriminant analysis (PLS-DA) classification models for use in future predictions. The experimental study was performed using a real hyperspectral short-wavelength infrared image data set collected from bakery products (buns) containing contaminants (flies) but similar applications for other insects, paper and plastic were also tested. The contaminants represent a very small proportion of the images relative to the bun. The PLS-DA model aims at accurately detecting and classifying the contaminants and this requires a modification of the calibration data set. The paper deals with problems caused by unbalanced calibration data sets and how to remedy them. In the example it was demonstrated that, by balancing the calibration data from 58,476 bun pixels + 279 fly pixels to 279 bun + 279 fly pixels, the number of true predictions could be improved with a smaller number of PLS components used in the model. The improvement for flies increased from 65% true predictions with ten PLS components to &gt; 99% true prediction with five to six PLS components. The true prediction for bun went from 100% to 99.5% with six PLS components which is an acceptable reduction. Theoretical explanations are included.</description><subject>Annan kemi</subject><subject>classification</subject><subject>hyperspectral imaging</subject><subject>obtaining a balanced dataset</subject><subject>Other Chemistry Topics</subject><subject>PLS-DA</subject><subject>unbalanced model</subject><issn>0967-0335</issn><issn>1751-6552</issn><issn>1751-6552</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2011</creationdate><recordtype>article</recordtype><recordid>eNp1kdtqGzEQhkVpoSbJRd9AV4VCN9UhWlu9c-22MRgaiNtbMauddWX2VI2W4JfoM1ebhF61wzAziE-_GP2MvZHiWipjPpz6EOnaavWCLeTSyKI0Rr1kC2HLZSG0Nq_ZFdFJ5FjllHrBfh9-It914xAT9B750PBP0M5jzbeQgN9jIt4Mkd9BTAFavkegxO9_TRCR-DaQj6ELPfSJr3tozxToI9-0QBSa4CGFoed3caha7Ih_p9Af-e15xEgj-hSz4K6D43w6P3fJXjXQEl499wt2-PL5sLkt9t--7jbrfeHzFqmosSlvSit9raQtFSoUK6x0LbyAWikNAr1FqwwurSh1XSkFudobZbSRVl-w4kmWHnCcKjfmFSCe3QDBUTtVEOfmCJ1ZrazK_Pv_8tvwY-2GeHRTNzkjlkJk_N0T7uNAFLH5e0EKNzvlHp1y2anMvn2WhiO60zDF_In0D_APCRKXbQ</recordid><startdate>20110101</startdate><enddate>20110101</enddate><creator>Lindström, Susanne W.</creator><creator>Geladi, Paul</creator><creator>Jonsson, Oskar</creator><creator>Pettersson, Fredrik</creator><general>SAGE Publications</general><scope>AAYXX</scope><scope>CITATION</scope><scope>ADTPV</scope><scope>AOWAS</scope><scope>D93</scope></search><sort><creationdate>20110101</creationdate><title>The Importance of Balanced Data Sets for Partial Least Squares Discriminant Analysis: Classification Problems Using Hyperspectral Imaging Data</title><author>Lindström, Susanne W. ; Geladi, Paul ; Jonsson, Oskar ; Pettersson, Fredrik</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c335t-def64691cd21962e2e08eb3d0c0ad223a0ec9e925e79063db22a3db942535193</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2011</creationdate><topic>Annan kemi</topic><topic>classification</topic><topic>hyperspectral imaging</topic><topic>obtaining a balanced dataset</topic><topic>Other Chemistry Topics</topic><topic>PLS-DA</topic><topic>unbalanced model</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lindström, Susanne W.</creatorcontrib><creatorcontrib>Geladi, Paul</creatorcontrib><creatorcontrib>Jonsson, Oskar</creatorcontrib><creatorcontrib>Pettersson, Fredrik</creatorcontrib><creatorcontrib>Sveriges lantbruksuniversitet</creatorcontrib><collection>CrossRef</collection><collection>SwePub</collection><collection>SwePub Articles</collection><collection>SWEPUB Umeå universitet</collection><jtitle>Journal of near infrared spectroscopy (United Kingdom)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lindström, Susanne W.</au><au>Geladi, Paul</au><au>Jonsson, Oskar</au><au>Pettersson, Fredrik</au><aucorp>Sveriges lantbruksuniversitet</aucorp><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Importance of Balanced Data Sets for Partial Least Squares Discriminant Analysis: Classification Problems Using Hyperspectral Imaging Data</atitle><jtitle>Journal of near infrared spectroscopy (United Kingdom)</jtitle><date>2011-01-01</date><risdate>2011</risdate><volume>19</volume><issue>4</issue><spage>233</spage><epage>241</epage><pages>233-241</pages><issn>0967-0335</issn><issn>1751-6552</issn><eissn>1751-6552</eissn><abstract>This study investigates the effect of imbalanced spectral data in the training set, when developing partial least squares discriminant analysis (PLS-DA) classification models for use in future predictions. The experimental study was performed using a real hyperspectral short-wavelength infrared image data set collected from bakery products (buns) containing contaminants (flies) but similar applications for other insects, paper and plastic were also tested. The contaminants represent a very small proportion of the images relative to the bun. The PLS-DA model aims at accurately detecting and classifying the contaminants and this requires a modification of the calibration data set. The paper deals with problems caused by unbalanced calibration data sets and how to remedy them. In the example it was demonstrated that, by balancing the calibration data from 58,476 bun pixels + 279 fly pixels to 279 bun + 279 fly pixels, the number of true predictions could be improved with a smaller number of PLS components used in the model. The improvement for flies increased from 65% true predictions with ten PLS components to &gt; 99% true prediction with five to six PLS components. The true prediction for bun went from 100% to 99.5% with six PLS components which is an acceptable reduction. Theoretical explanations are included.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><doi>10.1255/jnirs.932</doi><tpages>9</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0967-0335
ispartof Journal of near infrared spectroscopy (United Kingdom), 2011-01, Vol.19 (4), p.233-241
issn 0967-0335
1751-6552
1751-6552
language eng
recordid cdi_swepub_primary_oai_slubar_slu_se_58892
source SAGE Complete A-Z List
subjects Annan kemi
classification
hyperspectral imaging
obtaining a balanced dataset
Other Chemistry Topics
PLS-DA
unbalanced model
title The Importance of Balanced Data Sets for Partial Least Squares Discriminant Analysis: Classification Problems Using Hyperspectral Imaging Data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T03%3A09%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-sage_swepu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Importance%20of%20Balanced%20Data%20Sets%20for%20Partial%20Least%20Squares%20Discriminant%20Analysis:%20Classification%20Problems%20Using%20Hyperspectral%20Imaging%20Data&rft.jtitle=Journal%20of%20near%20infrared%20spectroscopy%20(United%20Kingdom)&rft.au=Lindstr%C3%B6m,%20Susanne%20W.&rft.aucorp=Sveriges%20lantbruksuniversitet&rft.date=2011-01-01&rft.volume=19&rft.issue=4&rft.spage=233&rft.epage=241&rft.pages=233-241&rft.issn=0967-0335&rft.eissn=1751-6552&rft_id=info:doi/10.1255/jnirs.932&rft_dat=%3Csage_swepu%3E10.1255_jnirs.932%3C/sage_swepu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_sage_id=10.1255_jnirs.932&rfr_iscdi=true