Three Methods for Occupation Coding Based on Statistical Learning
Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combinin...
Gespeichert in:
Veröffentlicht in: | Journal of official statistics 2017-03, Vol.33 (1), p.101-122 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 122 |
---|---|
container_issue | 1 |
container_start_page | 101 |
container_title | Journal of official statistics |
container_volume | 33 |
creator | Gweon, Hyukjun Schonlau, Matthias Kaczmirek, Lars Blohm, Michael Steiner, Stefan |
description | Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches. |
doi_str_mv | 10.1515/jos-2017-0006 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1873744424</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1515_jos-2017-0006</sage_id><sourcerecordid>4318177921</sourcerecordid><originalsourceid>FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</originalsourceid><addsrcrecordid>eNqFkEtLAzEUhYMoWB9L9wF3QjSvmWRwVYsvqHRhBXchndy0U-qkJjNI_70p48KF4Oq-zncuHIQuGL1mBStu1iERTpkilNLyAI04pYwoUapDNKJccyK5eD9GJymtKRWV4GyExvNVBMAv0K2CS9iHiGd13W9t14QWT4Jr2iW-swkczvNrl_epa2q7wVOwsc3XM3Tk7SbB-U89RW8P9_PJE5nOHp8n4ylZFFx0RIEuFpzK0nMqahC1swKYWyiQFTjBrfaV085yVdg8VaoEoSvhuVfSF5qLU3Q5-G5j-OwhdWYd-tjml4ZpJZSUksusIoOqjiGlCN5sY_Nh484wavYpZSqZfUpmn1LW3w76L7vpIDpYxn6Xm1_mf3KCMcoyfTXQyS7hP-QbOFp6kg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1873744424</pqid></control><display><type>article</type><title>Three Methods for Occupation Coding Based on Statistical Learning</title><source>De Gruyter Open Access Journals</source><source>Sage Journals GOLD Open Access 2024</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Sociological Abstracts</source><creator>Gweon, Hyukjun ; Schonlau, Matthias ; Kaczmirek, Lars ; Blohm, Michael ; Steiner, Stefan</creator><creatorcontrib>Gweon, Hyukjun ; Schonlau, Matthias ; Kaczmirek, Lars ; Blohm, Michael ; Steiner, Stefan</creatorcontrib><description>Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.</description><identifier>ISSN: 0282-423X</identifier><identifier>ISSN: 2001-7367</identifier><identifier>EISSN: 2001-7367</identifier><identifier>DOI: 10.1515/jos-2017-0006</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>ALLBUS ; Artificial intelligence ; Automated coding ; Coding standards ; ISCO-88 ; Machine learning ; Occupations ; Statistical analysis ; Statistical data</subject><ispartof>Journal of official statistics, 2017-03, Vol.33 (1), p.101-122</ispartof><rights>by Hyukjun Gweon</rights><rights>Copyright Statistics Sweden (SCB) 2017</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</citedby><cites>FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://journals.sagepub.com/doi/pdf/10.1515/jos-2017-0006$$EPDF$$P50$$Gsage$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://journals.sagepub.com/doi/10.1515/jos-2017-0006$$EHTML$$P50$$Gsage$$Hfree_for_read</linktohtml><link.rule.ids>314,778,782,21955,27333,27842,27913,27914,33763,44934,45322,66917,68701</link.rule.ids></links><search><creatorcontrib>Gweon, Hyukjun</creatorcontrib><creatorcontrib>Schonlau, Matthias</creatorcontrib><creatorcontrib>Kaczmirek, Lars</creatorcontrib><creatorcontrib>Blohm, Michael</creatorcontrib><creatorcontrib>Steiner, Stefan</creatorcontrib><title>Three Methods for Occupation Coding Based on Statistical Learning</title><title>Journal of official statistics</title><description>Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.</description><subject>ALLBUS</subject><subject>Artificial intelligence</subject><subject>Automated coding</subject><subject>Coding standards</subject><subject>ISCO-88</subject><subject>Machine learning</subject><subject>Occupations</subject><subject>Statistical analysis</subject><subject>Statistical data</subject><issn>0282-423X</issn><issn>2001-7367</issn><issn>2001-7367</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>AFRWT</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>BHHNA</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNqFkEtLAzEUhYMoWB9L9wF3QjSvmWRwVYsvqHRhBXchndy0U-qkJjNI_70p48KF4Oq-zncuHIQuGL1mBStu1iERTpkilNLyAI04pYwoUapDNKJccyK5eD9GJymtKRWV4GyExvNVBMAv0K2CS9iHiGd13W9t14QWT4Jr2iW-swkczvNrl_epa2q7wVOwsc3XM3Tk7SbB-U89RW8P9_PJE5nOHp8n4ylZFFx0RIEuFpzK0nMqahC1swKYWyiQFTjBrfaV085yVdg8VaoEoSvhuVfSF5qLU3Q5-G5j-OwhdWYd-tjml4ZpJZSUksusIoOqjiGlCN5sY_Nh484wavYpZSqZfUpmn1LW3w76L7vpIDpYxn6Xm1_mf3KCMcoyfTXQyS7hP-QbOFp6kg</recordid><startdate>20170301</startdate><enddate>20170301</enddate><creator>Gweon, Hyukjun</creator><creator>Schonlau, Matthias</creator><creator>Kaczmirek, Lars</creator><creator>Blohm, Michael</creator><creator>Steiner, Stefan</creator><general>SAGE Publications</general><general>De Gruyter Open</general><general>Statistics Sweden (SCB)</general><scope>AFRWT</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>0-V</scope><scope>3V.</scope><scope>7U4</scope><scope>7XB</scope><scope>88J</scope><scope>8C1</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ALSLI</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BFMQW</scope><scope>BGLVJ</scope><scope>BHHNA</scope><scope>CCPQU</scope><scope>DWI</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>HEHIP</scope><scope>L6V</scope><scope>M2R</scope><scope>M2S</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>Q9U</scope><scope>WZK</scope></search><sort><creationdate>20170301</creationdate><title>Three Methods for Occupation Coding Based on Statistical Learning</title><author>Gweon, Hyukjun ; Schonlau, Matthias ; Kaczmirek, Lars ; Blohm, Michael ; Steiner, Stefan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-b523t-7e85b2046f203ce3cda3e1db7e49ed32a8f9d8da275a32a976e3893f2f74f5823</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>ALLBUS</topic><topic>Artificial intelligence</topic><topic>Automated coding</topic><topic>Coding standards</topic><topic>ISCO-88</topic><topic>Machine learning</topic><topic>Occupations</topic><topic>Statistical analysis</topic><topic>Statistical data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Gweon, Hyukjun</creatorcontrib><creatorcontrib>Schonlau, Matthias</creatorcontrib><creatorcontrib>Kaczmirek, Lars</creatorcontrib><creatorcontrib>Blohm, Michael</creatorcontrib><creatorcontrib>Steiner, Stefan</creatorcontrib><collection>Sage Journals GOLD Open Access 2024</collection><collection>CrossRef</collection><collection>ProQuest Social Sciences Premium Collection</collection><collection>ProQuest Central (Corporate)</collection><collection>Sociological Abstracts (pre-2017)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Social Science Database (Alumni Edition)</collection><collection>Public Health Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Social Science Premium Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Continental Europe Database</collection><collection>Technology Collection</collection><collection>Sociological Abstracts</collection><collection>ProQuest One Community College</collection><collection>Sociological Abstracts</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>Sociology Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Social Science Database</collection><collection>Sociology Database</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><collection>Sociological Abstracts (Ovid)</collection><jtitle>Journal of official statistics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gweon, Hyukjun</au><au>Schonlau, Matthias</au><au>Kaczmirek, Lars</au><au>Blohm, Michael</au><au>Steiner, Stefan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Three Methods for Occupation Coding Based on Statistical Learning</atitle><jtitle>Journal of official statistics</jtitle><date>2017-03-01</date><risdate>2017</risdate><volume>33</volume><issue>1</issue><spage>101</spage><epage>122</epage><pages>101-122</pages><issn>0282-423X</issn><issn>2001-7367</issn><eissn>2001-7367</eissn><abstract>Occupation coding, an important task in official statistics, refers to coding a respondent’s text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modified nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we find defining duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><doi>10.1515/jos-2017-0006</doi><tpages>22</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0282-423X |
ispartof | Journal of official statistics, 2017-03, Vol.33 (1), p.101-122 |
issn | 0282-423X 2001-7367 2001-7367 |
language | eng |
recordid | cdi_proquest_journals_1873744424 |
source | De Gruyter Open Access Journals; Sage Journals GOLD Open Access 2024; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Sociological Abstracts |
subjects | ALLBUS Artificial intelligence Automated coding Coding standards ISCO-88 Machine learning Occupations Statistical analysis Statistical data |
title | Three Methods for Occupation Coding Based on Statistical Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T10%3A06%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Three%20Methods%20for%20Occupation%20Coding%20Based%20on%20Statistical%20Learning&rft.jtitle=Journal%20of%20official%20statistics&rft.au=Gweon,%20Hyukjun&rft.date=2017-03-01&rft.volume=33&rft.issue=1&rft.spage=101&rft.epage=122&rft.pages=101-122&rft.issn=0282-423X&rft.eissn=2001-7367&rft_id=info:doi/10.1515/jos-2017-0006&rft_dat=%3Cproquest_cross%3E4318177921%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1873744424&rft_id=info:pmid/&rft_sage_id=10.1515_jos-2017-0006&rfr_iscdi=true |