How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution?

Interrater reliability studies are used in a diverse set of fields. Often, these investigations involve three or more raters, and thus, require the use of indices such as Fleiss's kappa, Conger's kappa, or Krippendorff's alpha. Through two motivating examples-one theoretical and one f...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The American statistician 2016-11, Vol.70 (4), p.373-384
Hauptverfasser: Quarfoot, David, Levine, Richard A.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 384
container_issue 4
container_start_page 373
container_title The American statistician
container_volume 70
creator Quarfoot, David
Levine, Richard A.
description Interrater reliability studies are used in a diverse set of fields. Often, these investigations involve three or more raters, and thus, require the use of indices such as Fleiss's kappa, Conger's kappa, or Krippendorff's alpha. Through two motivating examples-one theoretical and one from practice-this article exposes limitations of these indices when the units to be rated are not well-distributed across the rating categories. Then, using a Monte Carlo simulation and information visualizations, we argue for the use of two alternative indices, the Brennan-Prediger coefficient and Gwet's AC2, because the agreement levels reported by these indices are more robust to variation in the distribution of units that raters encounter. The article concludes by exploring the complex, interwoven relationship between the number of levels in a rating instrument, the agreement level present among raters, and the distribution of units that are to be scored. Supplementary materials for this article are available online.
doi_str_mv 10.1080/00031305.2016.1141708
format Article
fullrecord <record><control><sourceid>jstor_cross</sourceid><recordid>TN_cdi_jstor_primary_45118396</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>45118396</jstor_id><sourcerecordid>45118396</sourcerecordid><originalsourceid>FETCH-LOGICAL-c407t-1d47756c48ed2ed76d1436ee4147868210f15d07d123333ceac7c4ca8594e1bb3</originalsourceid><addsrcrecordid>eNp9UF1LwzAUDaLgnP6EQcHnztwmbbonHdO5wUQY-mpI01QzumYmKaP_3pROH70P9_Ocey8HoQngKeAc32GMCRCcThMM2RSAAsP5GRpBSlicMALnaNRj4h50ia6c24USsywZoY-VOUZbU7TOR3Orope29toKr2y0boIf0q2qtSh0rX0X2qWWykXeRIsv0XyGVDfR0qrvVjWyix6181YXrdemub9GF5Wonbo5xTF6Xz69LVbx5vV5vZhvYkkx8zGUlLE0kzRXZaJKlpVASaYUBcryLE8AV5CWmJWQkGBSCckklSJPZ1RBUZAxuh32HqwJfzjPd6a1TTjJk5RSQikDFlDpgJLWOGdVxQ9W74XtOGDeS8l_peS9lPwkZeBNBt7OeWP_SDQFyMksC_OHYa6byti9OBpbl9yLrja2sqKR2nHy_4kf1lWDog</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2544344717</pqid></control><display><type>article</type><title>How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution?</title><source>JSTOR Mathematics &amp; Statistics</source><source>JSTOR Archive Collection A-Z Listing</source><creator>Quarfoot, David ; Levine, Richard A.</creator><creatorcontrib>Quarfoot, David ; Levine, Richard A.</creatorcontrib><description>Interrater reliability studies are used in a diverse set of fields. Often, these investigations involve three or more raters, and thus, require the use of indices such as Fleiss's kappa, Conger's kappa, or Krippendorff's alpha. Through two motivating examples-one theoretical and one from practice-this article exposes limitations of these indices when the units to be rated are not well-distributed across the rating categories. Then, using a Monte Carlo simulation and information visualizations, we argue for the use of two alternative indices, the Brennan-Prediger coefficient and Gwet's AC2, because the agreement levels reported by these indices are more robust to variation in the distribution of units that raters encounter. The article concludes by exploring the complex, interwoven relationship between the number of levels in a rating instrument, the agreement level present among raters, and the distribution of units that are to be scored. Supplementary materials for this article are available online.</description><identifier>ISSN: 0003-1305</identifier><identifier>EISSN: 1537-2731</identifier><identifier>DOI: 10.1080/00031305.2016.1141708</identifier><language>eng</language><publisher>Alexandria: Taylor &amp; Francis</publisher><subject>Agreement ; Conger ; Fleiss ; Frequency distribution ; Gwet ; Krippendorff ; Monte Carlo simulation ; Paradox ; Regression analysis ; Reliability ; Reliability analysis ; Robustness ; Statistical methods ; STATISTICAL PRACTICE ; Statistics</subject><ispartof>The American statistician, 2016-11, Vol.70 (4), p.373-384</ispartof><rights>2016 American Statistical Association 2016</rights><rights>Copyright 2016 American Statistical Association</rights><rights>2016 American Statistical Association</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c407t-1d47756c48ed2ed76d1436ee4147868210f15d07d123333ceac7c4ca8594e1bb3</citedby><cites>FETCH-LOGICAL-c407t-1d47756c48ed2ed76d1436ee4147868210f15d07d123333ceac7c4ca8594e1bb3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/45118396$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/45118396$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>314,780,784,803,832,27924,27925,58017,58021,58250,58254</link.rule.ids></links><search><creatorcontrib>Quarfoot, David</creatorcontrib><creatorcontrib>Levine, Richard A.</creatorcontrib><title>How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution?</title><title>The American statistician</title><description>Interrater reliability studies are used in a diverse set of fields. Often, these investigations involve three or more raters, and thus, require the use of indices such as Fleiss's kappa, Conger's kappa, or Krippendorff's alpha. Through two motivating examples-one theoretical and one from practice-this article exposes limitations of these indices when the units to be rated are not well-distributed across the rating categories. Then, using a Monte Carlo simulation and information visualizations, we argue for the use of two alternative indices, the Brennan-Prediger coefficient and Gwet's AC2, because the agreement levels reported by these indices are more robust to variation in the distribution of units that raters encounter. The article concludes by exploring the complex, interwoven relationship between the number of levels in a rating instrument, the agreement level present among raters, and the distribution of units that are to be scored. Supplementary materials for this article are available online.</description><subject>Agreement</subject><subject>Conger</subject><subject>Fleiss</subject><subject>Frequency distribution</subject><subject>Gwet</subject><subject>Krippendorff</subject><subject>Monte Carlo simulation</subject><subject>Paradox</subject><subject>Regression analysis</subject><subject>Reliability</subject><subject>Reliability analysis</subject><subject>Robustness</subject><subject>Statistical methods</subject><subject>STATISTICAL PRACTICE</subject><subject>Statistics</subject><issn>0003-1305</issn><issn>1537-2731</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><recordid>eNp9UF1LwzAUDaLgnP6EQcHnztwmbbonHdO5wUQY-mpI01QzumYmKaP_3pROH70P9_Ocey8HoQngKeAc32GMCRCcThMM2RSAAsP5GRpBSlicMALnaNRj4h50ia6c24USsywZoY-VOUZbU7TOR3Orope29toKr2y0boIf0q2qtSh0rX0X2qWWykXeRIsv0XyGVDfR0qrvVjWyix6181YXrdemub9GF5Wonbo5xTF6Xz69LVbx5vV5vZhvYkkx8zGUlLE0kzRXZaJKlpVASaYUBcryLE8AV5CWmJWQkGBSCckklSJPZ1RBUZAxuh32HqwJfzjPd6a1TTjJk5RSQikDFlDpgJLWOGdVxQ9W74XtOGDeS8l_peS9lPwkZeBNBt7OeWP_SDQFyMksC_OHYa6byti9OBpbl9yLrja2sqKR2nHy_4kf1lWDog</recordid><startdate>20161101</startdate><enddate>20161101</enddate><creator>Quarfoot, David</creator><creator>Levine, Richard A.</creator><general>Taylor &amp; Francis</general><general>American Statistical Association</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20161101</creationdate><title>How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution?</title><author>Quarfoot, David ; Levine, Richard A.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c407t-1d47756c48ed2ed76d1436ee4147868210f15d07d123333ceac7c4ca8594e1bb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Agreement</topic><topic>Conger</topic><topic>Fleiss</topic><topic>Frequency distribution</topic><topic>Gwet</topic><topic>Krippendorff</topic><topic>Monte Carlo simulation</topic><topic>Paradox</topic><topic>Regression analysis</topic><topic>Reliability</topic><topic>Reliability analysis</topic><topic>Robustness</topic><topic>Statistical methods</topic><topic>STATISTICAL PRACTICE</topic><topic>Statistics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Quarfoot, David</creatorcontrib><creatorcontrib>Levine, Richard A.</creatorcontrib><collection>CrossRef</collection><jtitle>The American statistician</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Quarfoot, David</au><au>Levine, Richard A.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution?</atitle><jtitle>The American statistician</jtitle><date>2016-11-01</date><risdate>2016</risdate><volume>70</volume><issue>4</issue><spage>373</spage><epage>384</epage><pages>373-384</pages><issn>0003-1305</issn><eissn>1537-2731</eissn><abstract>Interrater reliability studies are used in a diverse set of fields. Often, these investigations involve three or more raters, and thus, require the use of indices such as Fleiss's kappa, Conger's kappa, or Krippendorff's alpha. Through two motivating examples-one theoretical and one from practice-this article exposes limitations of these indices when the units to be rated are not well-distributed across the rating categories. Then, using a Monte Carlo simulation and information visualizations, we argue for the use of two alternative indices, the Brennan-Prediger coefficient and Gwet's AC2, because the agreement levels reported by these indices are more robust to variation in the distribution of units that raters encounter. The article concludes by exploring the complex, interwoven relationship between the number of levels in a rating instrument, the agreement level present among raters, and the distribution of units that are to be scored. Supplementary materials for this article are available online.</abstract><cop>Alexandria</cop><pub>Taylor &amp; Francis</pub><doi>10.1080/00031305.2016.1141708</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0003-1305
ispartof The American statistician, 2016-11, Vol.70 (4), p.373-384
issn 0003-1305
1537-2731
language eng
recordid cdi_jstor_primary_45118396
source JSTOR Mathematics & Statistics; JSTOR Archive Collection A-Z Listing
subjects Agreement
Conger
Fleiss
Frequency distribution
Gwet
Krippendorff
Monte Carlo simulation
Paradox
Regression analysis
Reliability
Reliability analysis
Robustness
Statistical methods
STATISTICAL PRACTICE
Statistics
title How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution?
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T12%3A28%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=How%20Robust%20Are%20Multirater%20Interrater%20Reliability%20Indices%20to%20Changes%20in%20Frequency%20Distribution?&rft.jtitle=The%20American%20statistician&rft.au=Quarfoot,%20David&rft.date=2016-11-01&rft.volume=70&rft.issue=4&rft.spage=373&rft.epage=384&rft.pages=373-384&rft.issn=0003-1305&rft.eissn=1537-2731&rft_id=info:doi/10.1080/00031305.2016.1141708&rft_dat=%3Cjstor_cross%3E45118396%3C/jstor_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2544344717&rft_id=info:pmid/&rft_jstor_id=45118396&rfr_iscdi=true