Three Methods for Occupation Coding Based on Statistical Learning

Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combinin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of official statistics 2017-03, Vol.33 (1), p.101-122
Hauptverfasser: Gweon, Hyukjun, Schonlau, Matthias, Kaczmirek, Lars, Blohm, Michael, Steiner, Stefan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 122
container_issue 1
container_start_page 101
container_title Journal of official statistics
container_volume 33
creator Gweon, Hyukjun
Schonlau, Matthias
Kaczmirek, Lars
Blohm, Michael
Steiner, Stefan
description Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.
doi_str_mv 10.1515/jos-2017-0006
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1873744424</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1515_jos-2017-0006</sage_id><sourcerecordid>4318177921</sourcerecordid><originalsourceid>FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</originalsourceid><addsrcrecordid>eNqFkEtLAzEUhYMoWB9L9wF3QjSvmWRwVYsvqHRhBXchndy0U-qkJjNI_70p48KF4Oq-zncuHIQuGL1mBStu1iERTpkilNLyAI04pYwoUapDNKJccyK5eD9GJymtKRWV4GyExvNVBMAv0K2CS9iHiGd13W9t14QWT4Jr2iW-swkczvNrl_epa2q7wVOwsc3XM3Tk7SbB-U89RW8P9_PJE5nOHp8n4ylZFFx0RIEuFpzK0nMqahC1swKYWyiQFTjBrfaV085yVdg8VaoEoSvhuVfSF5qLU3Q5-G5j-OwhdWYd-tjml4ZpJZSUksusIoOqjiGlCN5sY_Nh484wavYpZSqZfUpmn1LW3w76L7vpIDpYxn6Xm1_mf3KCMcoyfTXQyS7hP-QbOFp6kg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1873744424</pqid></control><display><type>article</type><title>Three Methods for Occupation Coding Based on Statistical Learning</title><source>De Gruyter Open Access Journals</source><source>Sage Journals GOLD Open Access 2024</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Sociological Abstracts</source><creator>Gweon, Hyukjun ; Schonlau, Matthias ; Kaczmirek, Lars ; Blohm, Michael ; Steiner, Stefan</creator><creatorcontrib>Gweon, Hyukjun ; Schonlau, Matthias ; Kaczmirek, Lars ; Blohm, Michael ; Steiner, Stefan</creatorcontrib><description>Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.</description><identifier>ISSN: 0282-423X</identifier><identifier>ISSN: 2001-7367</identifier><identifier>EISSN: 2001-7367</identifier><identifier>DOI: 10.1515/jos-2017-0006</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>ALLBUS ; Artificial intelligence ; Automated coding ; Coding standards ; ISCO-88 ; Machine learning ; Occupations ; Statistical analysis ; Statistical data</subject><ispartof>Journal of official statistics, 2017-03, Vol.33 (1), p.101-122</ispartof><rights>by Hyukjun Gweon</rights><rights>Copyright Statistics Sweden (SCB) 2017</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</citedby><cites>FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://journals.sagepub.com/doi/pdf/10.1515/jos-2017-0006$$EPDF$$P50$$Gsage$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://journals.sagepub.com/doi/10.1515/jos-2017-0006$$EHTML$$P50$$Gsage$$Hfree_for_read</linktohtml><link.rule.ids>314,778,782,21955,27333,27842,27913,27914,33763,44934,45322,66917,68701</link.rule.ids></links><search><creatorcontrib>Gweon, Hyukjun</creatorcontrib><creatorcontrib>Schonlau, Matthias</creatorcontrib><creatorcontrib>Kaczmirek, Lars</creatorcontrib><creatorcontrib>Blohm, Michael</creatorcontrib><creatorcontrib>Steiner, Stefan</creatorcontrib><title>Three Methods for Occupation Coding Based on Statistical Learning</title><title>Journal of official statistics</title><description>Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.</description><subject>ALLBUS</subject><subject>Artificial intelligence</subject><subject>Automated coding</subject><subject>Coding standards</subject><subject>ISCO-88</subject><subject>Machine learning</subject><subject>Occupations</subject><subject>Statistical analysis</subject><subject>Statistical data</subject><issn>0282-423X</issn><issn>2001-7367</issn><issn>2001-7367</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>AFRWT</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>BHHNA</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNqFkEtLAzEUhYMoWB9L9wF3QjSvmWRwVYsvqHRhBXchndy0U-qkJjNI_70p48KF4Oq-zncuHIQuGL1mBStu1iERTpkilNLyAI04pYwoUapDNKJccyK5eD9GJymtKRWV4GyExvNVBMAv0K2CS9iHiGd13W9t14QWT4Jr2iW-swkczvNrl_epa2q7wVOwsc3XM3Tk7SbB-U89RW8P9_PJE5nOHp8n4ylZFFx0RIEuFpzK0nMqahC1swKYWyiQFTjBrfaV085yVdg8VaoEoSvhuVfSF5qLU3Q5-G5j-OwhdWYd-tjml4ZpJZSUksusIoOqjiGlCN5sY_Nh484wavYpZSqZfUpmn1LW3w76L7vpIDpYxn6Xm1_mf3KCMcoyfTXQyS7hP-QbOFp6kg</recordid><startdate>20170301</startdate><enddate>20170301</enddate><creator>Gweon, Hyukjun</creator><creator>Schonlau, Matthias</creator><creator>Kaczmirek, Lars</creator><creator>Blohm, Michael</creator><creator>Steiner, Stefan</creator><general>SAGE Publications</general><general>De Gruyter Open</general><general>Statistics Sweden (SCB)</general><scope>AFRWT</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>0-V</scope><scope>3V.</scope><scope>7U4</scope><scope>7XB</scope><scope>88J</scope><scope>8C1</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BFMQW</scope><scope>BGLVJ</scope><scope>BHHNA</scope><scope>CCPQU</scope><scope>DWI</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>HEHIP</scope><scope>L6V</scope><scope>M2R</scope><scope>M2S</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>Q9U</scope><scope>WZK</scope></search><sort><creationdate>20170301</creationdate><title>Three Methods for Occupation Coding Based on Statistical Learning</title><author>Gweon, Hyukjun ; Schonlau, Matthias ; Kaczmirek, Lars ; Blohm, Michael ; Steiner, Stefan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>ALLBUS</topic><topic>Artificial intelligence</topic><topic>Automated coding</topic><topic>Coding standards</topic><topic>ISCO-88</topic><topic>Machine learning</topic><topic>Occupations</topic><topic>Statistical analysis</topic><topic>Statistical data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Gweon, Hyukjun</creatorcontrib><creatorcontrib>Schonlau, Matthias</creatorcontrib><creatorcontrib>Kaczmirek, Lars</creatorcontrib><creatorcontrib>Blohm, Michael</creatorcontrib><creatorcontrib>Steiner, Stefan</creatorcontrib><collection>Sage Journals GOLD Open Access 2024</collection><collection>CrossRef</collection><collection>ProQuest Social Sciences Premium Collection</collection><collection>ProQuest Central (Corporate)</collection><collection>Sociological Abstracts (pre-2017)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Social Science Database (Alumni Edition)</collection><collection>Public Health Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Continental Europe Database</collection><collection>Technology Collection</collection><collection>Sociological Abstracts</collection><collection>ProQuest One Community College</collection><collection>Sociological Abstracts</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>Sociology Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Social Science Database</collection><collection>Sociology Database</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><collection>Sociological Abstracts (Ovid)</collection><jtitle>Journal of official statistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gweon, Hyukjun</au><au>Schonlau, Matthias</au><au>Kaczmirek, Lars</au><au>Blohm, Michael</au><au>Steiner, Stefan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Three Methods for Occupation Coding Based on Statistical Learning</atitle><jtitle>Journal of official statistics</jtitle><date>2017-03-01</date><risdate>2017</risdate><volume>33</volume><issue>1</issue><spage>101</spage><epage>122</epage><pages>101-122</pages><issn>0282-423X</issn><issn>2001-7367</issn><eissn>2001-7367</eissn><abstract>Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><doi>10.1515/jos-2017-0006</doi><tpages>22</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0282-423X
ispartof Journal of official statistics, 2017-03, Vol.33 (1), p.101-122
issn 0282-423X
2001-7367
2001-7367
language eng
recordid cdi_proquest_journals_1873744424
source De Gruyter Open Access Journals; Sage Journals GOLD Open Access 2024; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Sociological Abstracts
subjects ALLBUS
Artificial intelligence
Automated coding
Coding standards
ISCO-88
Machine learning
Occupations
Statistical analysis
Statistical data
title Three Methods for Occupation Coding Based on Statistical Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T10%3A06%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Three%20Methods%20for%20Occupation%20Coding%20Based%20on%20Statistical%20Learning&rft.jtitle=Journal%20of%20official%20statistics&rft.au=Gweon,%20Hyukjun&rft.date=2017-03-01&rft.volume=33&rft.issue=1&rft.spage=101&rft.epage=122&rft.pages=101-122&rft.issn=0282-423X&rft.eissn=2001-7367&rft_id=info:doi/10.1515/jos-2017-0006&rft_dat=%3Cproquest_cross%3E4318177921%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1873744424&rft_id=info:pmid/&rft_sage_id=10.1515_jos-2017-0006&rfr_iscdi=true