High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models
The current deep learning models detecting relevant web pages show low accuracy because of the poor quality of the training data. In this paper, we propose a novel algorithm to automatically generate high-quality training data based on the frequency of the document including the entity of interest....
Gespeichert in:
Veröffentlicht in: | IEEE access 2021, Vol.9, p.85240-85254 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 85254 |
---|---|
container_issue | |
container_start_page | 85240 |
container_title | IEEE access |
container_volume | 9 |
creator | Kim, Jeong-Jae On, Byung-Won Lee, Ingyu |
description | The current deep learning models detecting relevant web pages show low accuracy because of the poor quality of the training data. In this paper, we propose a novel algorithm to automatically generate high-quality training data based on the frequency of the document including the entity of interest. Our experimental results with movies and cellphones data sets show that the average F_{1} -score of the deep learning models (FNN, CNN, Bi-LSTM, and SeqGAN) trained with our proposed algorithm shows up to 0.9992 in F_{1} -score. |
doi_str_mv | 10.1109/ACCESS.2021.3086586 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2542501754</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9446993</ieee_id><doaj_id>oai_doaj_org_article_fba0dcfad7c7436e978539ce34a73a80</doaj_id><sourcerecordid>2542501754</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-b69e64ff8502e7ee7090b07206f07f4bd996f0a1f9d4d8c0b4e55b52f26d9b173</originalsourceid><addsrcrecordid>eNpNUU1r3DAUNCWFhnR_QS6Cnr3Rt6xj6qRJYEtaNqX0JJ6tp60W19pK3kP-fZw4hLzLG4aZeQ-mqs4ZXTNG7cVl215vt2tOOVsL2mjV6A_VKWfa1kIJffIOf6pWpezpPM1MKXNa_bmNu7_1zyMMcXokDxniSK5gAnKDI2aYYhpJSJlcIR7IBiGPcdzVX6GgJ7-xIz9gh6QdoJQYYr_ovyePQ_lcfQwwFFy97rPq17frh_a23tzf3LWXm7qXtJnqTlvUMoRGUY4G0VBLO2o41YGaIDtv7YyABeulb3raSVSqUzxw7W3HjDir7pZcn2DvDjn-g_zoEkT3QqS8c5Cn2A_oQgfU9wG86Y0UGq1plLA9CglGQEPnrC9L1iGn_0csk9unYx7n9x1XkivKjJKzSiyqPqdSMoa3q4y650rcUol7rsS9VjK7zhdXRMQ3h5VSWyvEE6olhpM</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2542501754</pqid></control><display><type>article</type><title>High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Kim, Jeong-Jae ; On, Byung-Won ; Lee, Ingyu</creator><creatorcontrib>Kim, Jeong-Jae ; On, Byung-Won ; Lee, Ingyu</creatorcontrib><description><![CDATA[The current deep learning models detecting relevant web pages show low accuracy because of the poor quality of the training data. In this paper, we propose a novel algorithm to automatically generate high-quality training data based on the frequency of the document including the entity of interest. Our experimental results with movies and cellphones data sets show that the average <inline-formula> <tex-math notation="LaTeX">F_{1} </tex-math></inline-formula>-score of the deep learning models (FNN, CNN, Bi-LSTM, and SeqGAN) trained with our proposed algorithm shows up to 0.9992 in <inline-formula> <tex-math notation="LaTeX">F_{1} </tex-math></inline-formula>-score.]]></description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2021.3086586</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; automatic labelling ; Complexity theory ; Data mining ; Data models ; Deep learning ; Machine learning ; Text classification ; Training ; Training data ; Web pages ; Websites</subject><ispartof>IEEE access, 2021, Vol.9, p.85240-85254</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-b69e64ff8502e7ee7090b07206f07f4bd996f0a1f9d4d8c0b4e55b52f26d9b173</citedby><cites>FETCH-LOGICAL-c408t-b69e64ff8502e7ee7090b07206f07f4bd996f0a1f9d4d8c0b4e55b52f26d9b173</cites><orcidid>0000-0002-9629-6816 ; 0000-0001-6929-3188</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9446993$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,864,2100,4022,27632,27922,27923,27924,54932</link.rule.ids></links><search><creatorcontrib>Kim, Jeong-Jae</creatorcontrib><creatorcontrib>On, Byung-Won</creatorcontrib><creatorcontrib>Lee, Ingyu</creatorcontrib><title>High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models</title><title>IEEE access</title><addtitle>Access</addtitle><description><![CDATA[The current deep learning models detecting relevant web pages show low accuracy because of the poor quality of the training data. In this paper, we propose a novel algorithm to automatically generate high-quality training data based on the frequency of the document including the entity of interest. Our experimental results with movies and cellphones data sets show that the average <inline-formula> <tex-math notation="LaTeX">F_{1} </tex-math></inline-formula>-score of the deep learning models (FNN, CNN, Bi-LSTM, and SeqGAN) trained with our proposed algorithm shows up to 0.9992 in <inline-formula> <tex-math notation="LaTeX">F_{1} </tex-math></inline-formula>-score.]]></description><subject>Algorithms</subject><subject>automatic labelling</subject><subject>Complexity theory</subject><subject>Data mining</subject><subject>Data models</subject><subject>Deep learning</subject><subject>Machine learning</subject><subject>Text classification</subject><subject>Training</subject><subject>Training data</subject><subject>Web pages</subject><subject>Websites</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUU1r3DAUNCWFhnR_QS6Cnr3Rt6xj6qRJYEtaNqX0JJ6tp60W19pK3kP-fZw4hLzLG4aZeQ-mqs4ZXTNG7cVl215vt2tOOVsL2mjV6A_VKWfa1kIJffIOf6pWpezpPM1MKXNa_bmNu7_1zyMMcXokDxniSK5gAnKDI2aYYhpJSJlcIR7IBiGPcdzVX6GgJ7-xIz9gh6QdoJQYYr_ovyePQ_lcfQwwFFy97rPq17frh_a23tzf3LWXm7qXtJnqTlvUMoRGUY4G0VBLO2o41YGaIDtv7YyABeulb3raSVSqUzxw7W3HjDir7pZcn2DvDjn-g_zoEkT3QqS8c5Cn2A_oQgfU9wG86Y0UGq1plLA9CglGQEPnrC9L1iGn_0csk9unYx7n9x1XkivKjJKzSiyqPqdSMoa3q4y650rcUol7rsS9VjK7zhdXRMQ3h5VSWyvEE6olhpM</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Kim, Jeong-Jae</creator><creator>On, Byung-Won</creator><creator>Lee, Ingyu</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-9629-6816</orcidid><orcidid>https://orcid.org/0000-0001-6929-3188</orcidid></search><sort><creationdate>2021</creationdate><title>High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models</title><author>Kim, Jeong-Jae ; On, Byung-Won ; Lee, Ingyu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-b69e64ff8502e7ee7090b07206f07f4bd996f0a1f9d4d8c0b4e55b52f26d9b173</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>automatic labelling</topic><topic>Complexity theory</topic><topic>Data mining</topic><topic>Data models</topic><topic>Deep learning</topic><topic>Machine learning</topic><topic>Text classification</topic><topic>Training</topic><topic>Training data</topic><topic>Web pages</topic><topic>Websites</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kim, Jeong-Jae</creatorcontrib><creatorcontrib>On, Byung-Won</creatorcontrib><creatorcontrib>Lee, Ingyu</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Jeong-Jae</au><au>On, Byung-Won</au><au>Lee, Ingyu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2021</date><risdate>2021</risdate><volume>9</volume><spage>85240</spage><epage>85254</epage><pages>85240-85254</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract><![CDATA[The current deep learning models detecting relevant web pages show low accuracy because of the poor quality of the training data. In this paper, we propose a novel algorithm to automatically generate high-quality training data based on the frequency of the document including the entity of interest. Our experimental results with movies and cellphones data sets show that the average <inline-formula> <tex-math notation="LaTeX">F_{1} </tex-math></inline-formula>-score of the deep learning models (FNN, CNN, Bi-LSTM, and SeqGAN) trained with our proposed algorithm shows up to 0.9992 in <inline-formula> <tex-math notation="LaTeX">F_{1} </tex-math></inline-formula>-score.]]></abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2021.3086586</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-9629-6816</orcidid><orcidid>https://orcid.org/0000-0001-6929-3188</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2169-3536 |
ispartof | IEEE access, 2021, Vol.9, p.85240-85254 |
issn | 2169-3536 2169-3536 |
language | eng |
recordid | cdi_proquest_journals_2542501754 |
source | IEEE Open Access Journals; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals |
subjects | Algorithms automatic labelling Complexity theory Data mining Data models Deep learning Machine learning Text classification Training Training data Web pages Websites |
title | High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T18%3A47%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=High-Quality%20Train%20Data%20Generation%20for%20Deep%20Learning-Based%20Web%20Page%20Classification%20Models&rft.jtitle=IEEE%20access&rft.au=Kim,%20Jeong-Jae&rft.date=2021&rft.volume=9&rft.spage=85240&rft.epage=85254&rft.pages=85240-85254&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2021.3086586&rft_dat=%3Cproquest_cross%3E2542501754%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2542501754&rft_id=info:pmid/&rft_ieee_id=9446993&rft_doaj_id=oai_doaj_org_article_fba0dcfad7c7436e978539ce34a73a80&rfr_iscdi=true |