Three Methods for Occupation Coding Based on Statistical Learning

Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combinin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of official statistics 2017-03, Vol.33 (1), p.101-122
Hauptverfasser:	Gweon, Hyukjun, Schonlau, Matthias, Kaczmirek, Lars, Blohm, Michael, Steiner, Stefan
Format:	Artikel
Sprache:	eng
Schlagworte:	ALLBUS Artificial intelligence Automated coding Coding standards ISCO-88 Machine learning Occupations Statistical analysis Statistical data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	122
container_issue	1
container_start_page	101
container_title	Journal of official statistics
container_volume	33
creator	Gweon, Hyukjun Schonlau, Matthias Kaczmirek, Lars Blohm, Michael Steiner, Stefan
description	Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.
doi_str_mv	10.1515/jos-2017-0006
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1873744424</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1515_jos-2017-0006</sage_id><sourcerecordid>4318177921</sourcerecordid><originalsourceid>FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</originalsourceid><addsrcrecordid>eNqFkEtLAzEUhYMoWB9L9wF3QjSvmWRwVYsvqHRhBXchndy0U-qkJjNI_70p48KF4Oq-zncuHIQuGL1mBStu1iERTpkilNLyAI04pYwoUapDNKJccyK5eD9GJymtKRWV4GyExvNVBMAv0K2CS9iHiGd13W9t14QWT4Jr2iW-swkczvNrl_epa2q7wVOwsc3XM3Tk7SbB-U89RW8P9_PJE5nOHp8n4ylZFFx0RIEuFpzK0nMqahC1swKYWyiQFTjBrfaV085yVdg8VaoEoSvhuVfSF5qLU3Q5-G5j-OwhdWYd-tjml4ZpJZSUksusIoOqjiGlCN5sY_Nh484wavYpZSqZfUpmn1LW3w76L7vpIDpYxn6Xm1_mf3KCMcoyfTXQyS7hP-QbOFp6kg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1873744424</pqid></control><display><type>article</type><title>Three Methods for Occupation Coding Based on Statistical Learning</title><source>De Gruyter Open Access Journals</source><source>Sage Journals GOLD Open Access 2024</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Sociological Abstracts</source><creator>Gweon, Hyukjun ; Schonlau, Matthias ; Kaczmirek, Lars ; Blohm, Michael ; Steiner, Stefan</creator><creatorcontrib>Gweon, Hyukjun ; Schonlau, Matthias ; Kaczmirek, Lars ; Blohm, Michael ; Steiner, Stefan</creatorcontrib><description>Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.</description><identifier>ISSN: 0282-423X</identifier><identifier>ISSN: 2001-7367</identifier><identifier>EISSN: 2001-7367</identifier><identifier>DOI: 10.1515/jos-2017-0006</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>ALLBUS ; Artificial intelligence ; Automated coding ; Coding standards ; ISCO-88 ; Machine learning ; Occupations ; Statistical analysis ; Statistical data</subject><ispartof>Journal of official statistics, 2017-03, Vol.33 (1), p.101-122</ispartof><rights>by Hyukjun Gweon</rights><rights>Copyright Statistics Sweden (SCB) 2017</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</citedby><cites>FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://journals.sagepub.com/doi/pdf/10.1515/jos-2017-0006$$EPDF$$P50$$Gsage$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://journals.sagepub.com/doi/10.1515/jos-2017-0006$$EHTML$$P50$$Gsage$$Hfree_for_read</linktohtml><link.rule.ids>314,778,782,21955,27333,27842,27913,27914,33763,44934,45322,66917,68701</link.rule.ids></links><search><creatorcontrib>Gweon, Hyukjun</creatorcontrib><creatorcontrib>Schonlau, Matthias</creatorcontrib><creatorcontrib>Kaczmirek, Lars</creatorcontrib><creatorcontrib>Blohm, Michael</creatorcontrib><creatorcontrib>Steiner, Stefan</creatorcontrib><title>Three Methods for Occupation Coding Based on Statistical Learning</title><title>Journal of official statistics</title><description>Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.</description><subject>ALLBUS</subject><subject>Artificial intelligence</subject><subject>Automated coding</subject><subject>Coding standards</subject><subject>ISCO-88</subject><subject>Machine learning</subject><subject>Occupations</subject><subject>Statistical analysis</subject><subject>Statistical data</subject><issn>0282-423X</issn><issn>2001-7367</issn><issn>2001-7367</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>AFRWT</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>BHHNA</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNqFkEtLAzEUhYMoWB9L9wF3QjSvmWRwVYsvqHRhBXchndy0U-qkJjNI_70p48KF4Oq-zncuHIQuGL1mBStu1iERTpkilNLyAI04pYwoUapDNKJccyK5eD9GJymtKRWV4GyExvNVBMAv0K2CS9iHiGd13W9t14QWT4Jr2iW-swkczvNrl_epa2q7wVOwsc3XM3Tk7SbB-U89RW8P9_PJE5nOHp8n4ylZFFx0RIEuFpzK0nMqahC1swKYWyiQFTjBrfaV085yVdg8VaoEoSvhuVfSF5qLU3Q5-G5j-OwhdWYd-tjml4ZpJZSUksusIoOqjiGlCN5sY_Nh484wavYpZSqZfUpmn1LW3w76L7vpIDpYxn6Xm1_mf3KCMcoyfTXQyS7hP-QbOFp6kg</recordid><startdate>20170301</startdate><enddate>20170301</enddate><creator>Gweon, Hyukjun</creator><creator>Schonlau, Matthias</creator><creator>Kaczmirek, Lars</creator><creator>Blohm, Michael</creator><creator>Steiner, Stefan</creator><general>SAGE Publications</general><general>De Gruyter Open</general><general>Statistics Sweden (SCB)</general><scope>AFRWT</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>0-V</scope><scope>3V.</scope><scope>7U4</scope><scope>7XB</scope><scope>88J</scope><scope>8C1</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BFMQW</scope><scope>BGLVJ</scope><scope>BHHNA</scope><scope>CCPQU</scope><scope>DWI</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>HEHIP</scope><scope>L6V</scope><scope>M2R</scope><scope>M2S</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>Q9U</scope><scope>WZK</scope></search><sort><creationdate>20170301</creationdate><title>Three Methods for Occupation Coding Based on Statistical Learning</title><author>Gweon, Hyukjun ; Schonlau, Matthias ; Kaczmirek, Lars ; Blohm, Michael ; Steiner, Stefan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>ALLBUS</topic><topic>Artificial intelligence</topic><topic>Automated coding</topic><topic>Coding standards</topic><topic>ISCO-88</topic><topic>Machine learning</topic><topic>Occupations</topic><topic>Statistical analysis</topic><topic>Statistical data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Gweon, Hyukjun</creatorcontrib><creatorcontrib>Schonlau, Matthias</creatorcontrib><creatorcontrib>Kaczmirek, Lars</creatorcontrib><creatorcontrib>Blohm, Michael</creatorcontrib><creatorcontrib>Steiner, Stefan</creatorcontrib><collection>Sage Journals GOLD Open Access 2024</collection><collection>CrossRef</collection><collection>ProQuest Social Sciences Premium Collection</collection><collection>ProQuest Central (Corporate)</collection><collection>Sociological Abstracts (pre-2017)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Social Science Database (Alumni Edition)</collection><collection>Public Health Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Continental Europe Database</collection><collection>Technology Collection</collection><collection>Sociological Abstracts</collection><collection>ProQuest One Community College</collection><collection>Sociological Abstracts</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>Sociology Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Social Science Database</collection><collection>Sociology Database</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><collection>Sociological Abstracts (Ovid)</collection><jtitle>Journal of official statistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gweon, Hyukjun</au><au>Schonlau, Matthias</au><au>Kaczmirek, Lars</au><au>Blohm, Michael</au><au>Steiner, Stefan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Three Methods for Occupation Coding Based on Statistical Learning</atitle><jtitle>Journal of official statistics</jtitle><date>2017-03-01</date><risdate>2017</risdate><volume>33</volume><issue>1</issue><spage>101</spage><epage>122</epage><pages>101-122</pages><issn>0282-423X</issn><issn>2001-7367</issn><eissn>2001-7367</eissn><abstract>Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><doi>10.1515/jos-2017-0006</doi><tpages>22</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0282-423X
ispartof	Journal of official statistics, 2017-03, Vol.33 (1), p.101-122
issn	0282-423X 2001-7367 2001-7367
language	eng
recordid	cdi_proquest_journals_1873744424
source	De Gruyter Open Access Journals; Sage Journals GOLD Open Access 2024; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Sociological Abstracts
subjects	ALLBUS Artificial intelligence Automated coding Coding standards ISCO-88 Machine learning Occupations Statistical analysis Statistical data
title	Three Methods for Occupation Coding Based on Statistical Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T10%3A06%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Three%20Methods%20for%20Occupation%20Coding%20Based%20on%20Statistical%20Learning&rft.jtitle=Journal%20of%20official%20statistics&rft.au=Gweon,%20Hyukjun&rft.date=2017-03-01&rft.volume=33&rft.issue=1&rft.spage=101&rft.epage=122&rft.pages=101-122&rft.issn=0282-423X&rft.eissn=2001-7367&rft_id=info:doi/10.1515/jos-2017-0006&rft_dat=%3Cproquest_cross%3E4318177921%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1873744424&rft_id=info:pmid/&rft_sage_id=10.1515_jos-2017-0006&rfr_iscdi=true