Hypothesis testing procedure for binary and multi‐class F1‐scores in the paired design

In modern medicine, medical tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. To quantify the performance of the binary medical test, we often use sensitivity, specificity, and negative and positive predictive values as measures. Additionally...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Statistics in medicine 2023-10, Vol.42 (23), p.4177-4192
Hauptverfasser: Takahashi, Kanae, Yamamoto, Kouji, Kuchiba, Aya, Shintani, Ayumi, Koyama, Tatsuki
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 4192
container_issue 23
container_start_page 4177
container_title Statistics in medicine
container_volume 42
creator Takahashi, Kanae
Yamamoto, Kouji
Kuchiba, Aya
Shintani, Ayumi
Koyama, Tatsuki
description In modern medicine, medical tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. To quantify the performance of the binary medical test, we often use sensitivity, specificity, and negative and positive predictive values as measures. Additionally, the F1$$ {F}_1 $$‐score, which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has come to be used in the medical field due to its favorable characteristics. The F1$$ {F}_1 $$‐score has been extended for multi‐class classification, and two types of F1$$ {F}_1 $$‐scores have been proposed for multi‐class classification: a micro‐averaged F1$$ {F}_1 $$‐score and a macro‐averaged F1$$ {F}_1 $$‐score. The micro‐averaged F1$$ {F}_1 $$‐score pools per‐sample classifications across classes and then calculates the overall F1$$ {F}_1 $$‐score, whereas the macro‐averaged F1$$ {F}_1 $$‐score computes an arithmetic mean of the F1$$ {F}_1 $$‐scores for each class. Additionally, Sokolova and Lapalme1$$ {}^1 $$ gave an alternative definition of the macro‐averaged F1$$ {F}_1 $$‐score as the harmonic mean of the arithmetic means of the precision and recall over classes. Although some statistical methods of inference for binary and multi‐class F1$$ {F}_1 $$‐scores have been proposed, the methodology development of hypothesis testing procedure for them has not been fully progressing yet. Therefore, we aim to develop hypothesis testing procedure for comparing two F1$$ {F}_1 $$‐scores in paired study design based on the large sample multivariate central limit theorem.
doi_str_mv 10.1002/sim.9853
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11483486</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2845103036</sourcerecordid><originalsourceid>FETCH-LOGICAL-p306t-a9966456090c7f08ced144edf5cef7a264582b4084f334ccd01ef7ea27c7c1a93</originalsourceid><addsrcrecordid>eNpdkL1OwzAUhS0EglKQeARLLCwp1_FfMiFUUYpUiQUWlsh1nNYosYOdIHXjEXhGngQjusB0ru45-u4PQhcEZgQgv462m5UFpwdoQqCUGeS8OEQTyKXMhCT8BJ3G-ApACM_lMTqhMmkJdIJelrveD1sTbcSDiYN1G9wHr009BoMbH_DaOhV2WLkad2M72K-PT92qGPGCpDJqH0zE1uEEwb2ywdS4TriNO0NHjWqjOd_rFD0v7p7my2z1eP8wv11lPQUxZKoshWBcQAlaNlCk0YQxUzdcm0aqPHlFvmZQsIZSpnUNJPWNyqWWmqiSTtHNL7cf152ptXFDUG3VB9ulxSuvbPXXcXZbbfx7RQgrKCtEIlztCcG_jekLVWejNm2rnPFjrPKCcQIU6E_08l_01Y_BpftSSnBJBAVJvwHAanzW</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2865716307</pqid></control><display><type>article</type><title>Hypothesis testing procedure for binary and multi‐class F1‐scores in the paired design</title><source>Wiley Online Library Journals Frontfile Complete</source><creator>Takahashi, Kanae ; Yamamoto, Kouji ; Kuchiba, Aya ; Shintani, Ayumi ; Koyama, Tatsuki</creator><creatorcontrib>Takahashi, Kanae ; Yamamoto, Kouji ; Kuchiba, Aya ; Shintani, Ayumi ; Koyama, Tatsuki</creatorcontrib><description>In modern medicine, medical tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. To quantify the performance of the binary medical test, we often use sensitivity, specificity, and negative and positive predictive values as measures. Additionally, the F1$$ {F}_1 $$‐score, which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has come to be used in the medical field due to its favorable characteristics. The F1$$ {F}_1 $$‐score has been extended for multi‐class classification, and two types of F1$$ {F}_1 $$‐scores have been proposed for multi‐class classification: a micro‐averaged F1$$ {F}_1 $$‐score and a macro‐averaged F1$$ {F}_1 $$‐score. The micro‐averaged F1$$ {F}_1 $$‐score pools per‐sample classifications across classes and then calculates the overall F1$$ {F}_1 $$‐score, whereas the macro‐averaged F1$$ {F}_1 $$‐score computes an arithmetic mean of the F1$$ {F}_1 $$‐scores for each class. Additionally, Sokolova and Lapalme1$$ {}^1 $$ gave an alternative definition of the macro‐averaged F1$$ {F}_1 $$‐score as the harmonic mean of the arithmetic means of the precision and recall over classes. Although some statistical methods of inference for binary and multi‐class F1$$ {F}_1 $$‐scores have been proposed, the methodology development of hypothesis testing procedure for them has not been fully progressing yet. Therefore, we aim to develop hypothesis testing procedure for comparing two F1$$ {F}_1 $$‐scores in paired study design based on the large sample multivariate central limit theorem.</description><identifier>ISSN: 0277-6715</identifier><identifier>ISSN: 1097-0258</identifier><identifier>EISSN: 1097-0258</identifier><identifier>DOI: 10.1002/sim.9853</identifier><identifier>PMID: 37527903</identifier><language>eng</language><publisher>New York: Wiley Subscription Services, Inc</publisher><subject>Hypotheses ; Hypothesis testing ; Medical screening</subject><ispartof>Statistics in medicine, 2023-10, Vol.42 (23), p.4177-4192</ispartof><rights>2023 John Wiley &amp; Sons, Ltd.</rights><rights>2023 John Wiley &amp; Sons Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,776,780,881,27901,27902</link.rule.ids></links><search><creatorcontrib>Takahashi, Kanae</creatorcontrib><creatorcontrib>Yamamoto, Kouji</creatorcontrib><creatorcontrib>Kuchiba, Aya</creatorcontrib><creatorcontrib>Shintani, Ayumi</creatorcontrib><creatorcontrib>Koyama, Tatsuki</creatorcontrib><title>Hypothesis testing procedure for binary and multi‐class F1‐scores in the paired design</title><title>Statistics in medicine</title><description>In modern medicine, medical tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. To quantify the performance of the binary medical test, we often use sensitivity, specificity, and negative and positive predictive values as measures. Additionally, the F1$$ {F}_1 $$‐score, which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has come to be used in the medical field due to its favorable characteristics. The F1$$ {F}_1 $$‐score has been extended for multi‐class classification, and two types of F1$$ {F}_1 $$‐scores have been proposed for multi‐class classification: a micro‐averaged F1$$ {F}_1 $$‐score and a macro‐averaged F1$$ {F}_1 $$‐score. The micro‐averaged F1$$ {F}_1 $$‐score pools per‐sample classifications across classes and then calculates the overall F1$$ {F}_1 $$‐score, whereas the macro‐averaged F1$$ {F}_1 $$‐score computes an arithmetic mean of the F1$$ {F}_1 $$‐scores for each class. Additionally, Sokolova and Lapalme1$$ {}^1 $$ gave an alternative definition of the macro‐averaged F1$$ {F}_1 $$‐score as the harmonic mean of the arithmetic means of the precision and recall over classes. Although some statistical methods of inference for binary and multi‐class F1$$ {F}_1 $$‐scores have been proposed, the methodology development of hypothesis testing procedure for them has not been fully progressing yet. Therefore, we aim to develop hypothesis testing procedure for comparing two F1$$ {F}_1 $$‐scores in paired study design based on the large sample multivariate central limit theorem.</description><subject>Hypotheses</subject><subject>Hypothesis testing</subject><subject>Medical screening</subject><issn>0277-6715</issn><issn>1097-0258</issn><issn>1097-0258</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNpdkL1OwzAUhS0EglKQeARLLCwp1_FfMiFUUYpUiQUWlsh1nNYosYOdIHXjEXhGngQjusB0ru45-u4PQhcEZgQgv462m5UFpwdoQqCUGeS8OEQTyKXMhCT8BJ3G-ApACM_lMTqhMmkJdIJelrveD1sTbcSDiYN1G9wHr009BoMbH_DaOhV2WLkad2M72K-PT92qGPGCpDJqH0zE1uEEwb2ywdS4TriNO0NHjWqjOd_rFD0v7p7my2z1eP8wv11lPQUxZKoshWBcQAlaNlCk0YQxUzdcm0aqPHlFvmZQsIZSpnUNJPWNyqWWmqiSTtHNL7cf152ptXFDUG3VB9ulxSuvbPXXcXZbbfx7RQgrKCtEIlztCcG_jekLVWejNm2rnPFjrPKCcQIU6E_08l_01Y_BpftSSnBJBAVJvwHAanzW</recordid><startdate>20231015</startdate><enddate>20231015</enddate><creator>Takahashi, Kanae</creator><creator>Yamamoto, Kouji</creator><creator>Kuchiba, Aya</creator><creator>Shintani, Ayumi</creator><creator>Koyama, Tatsuki</creator><general>Wiley Subscription Services, Inc</general><scope>K9.</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20231015</creationdate><title>Hypothesis testing procedure for binary and multi‐class F1‐scores in the paired design</title><author>Takahashi, Kanae ; Yamamoto, Kouji ; Kuchiba, Aya ; Shintani, Ayumi ; Koyama, Tatsuki</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-p306t-a9966456090c7f08ced144edf5cef7a264582b4084f334ccd01ef7ea27c7c1a93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Hypotheses</topic><topic>Hypothesis testing</topic><topic>Medical screening</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Takahashi, Kanae</creatorcontrib><creatorcontrib>Yamamoto, Kouji</creatorcontrib><creatorcontrib>Kuchiba, Aya</creatorcontrib><creatorcontrib>Shintani, Ayumi</creatorcontrib><creatorcontrib>Koyama, Tatsuki</creatorcontrib><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Statistics in medicine</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Takahashi, Kanae</au><au>Yamamoto, Kouji</au><au>Kuchiba, Aya</au><au>Shintani, Ayumi</au><au>Koyama, Tatsuki</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Hypothesis testing procedure for binary and multi‐class F1‐scores in the paired design</atitle><jtitle>Statistics in medicine</jtitle><date>2023-10-15</date><risdate>2023</risdate><volume>42</volume><issue>23</issue><spage>4177</spage><epage>4192</epage><pages>4177-4192</pages><issn>0277-6715</issn><issn>1097-0258</issn><eissn>1097-0258</eissn><abstract>In modern medicine, medical tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. To quantify the performance of the binary medical test, we often use sensitivity, specificity, and negative and positive predictive values as measures. Additionally, the F1$$ {F}_1 $$‐score, which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has come to be used in the medical field due to its favorable characteristics. The F1$$ {F}_1 $$‐score has been extended for multi‐class classification, and two types of F1$$ {F}_1 $$‐scores have been proposed for multi‐class classification: a micro‐averaged F1$$ {F}_1 $$‐score and a macro‐averaged F1$$ {F}_1 $$‐score. The micro‐averaged F1$$ {F}_1 $$‐score pools per‐sample classifications across classes and then calculates the overall F1$$ {F}_1 $$‐score, whereas the macro‐averaged F1$$ {F}_1 $$‐score computes an arithmetic mean of the F1$$ {F}_1 $$‐scores for each class. Additionally, Sokolova and Lapalme1$$ {}^1 $$ gave an alternative definition of the macro‐averaged F1$$ {F}_1 $$‐score as the harmonic mean of the arithmetic means of the precision and recall over classes. Although some statistical methods of inference for binary and multi‐class F1$$ {F}_1 $$‐scores have been proposed, the methodology development of hypothesis testing procedure for them has not been fully progressing yet. Therefore, we aim to develop hypothesis testing procedure for comparing two F1$$ {F}_1 $$‐scores in paired study design based on the large sample multivariate central limit theorem.</abstract><cop>New York</cop><pub>Wiley Subscription Services, Inc</pub><pmid>37527903</pmid><doi>10.1002/sim.9853</doi><tpages>16</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0277-6715
ispartof Statistics in medicine, 2023-10, Vol.42 (23), p.4177-4192
issn 0277-6715
1097-0258
1097-0258
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11483486
source Wiley Online Library Journals Frontfile Complete
subjects Hypotheses
Hypothesis testing
Medical screening
title Hypothesis testing procedure for binary and multi‐class F1‐scores in the paired design
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T20%3A50%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Hypothesis%20testing%20procedure%20for%20binary%20and%20multi%E2%80%90class%20F1%E2%80%90scores%20in%20the%20paired%20design&rft.jtitle=Statistics%20in%20medicine&rft.au=Takahashi,%20Kanae&rft.date=2023-10-15&rft.volume=42&rft.issue=23&rft.spage=4177&rft.epage=4192&rft.pages=4177-4192&rft.issn=0277-6715&rft.eissn=1097-0258&rft_id=info:doi/10.1002/sim.9853&rft_dat=%3Cproquest_pubme%3E2845103036%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2865716307&rft_id=info:pmid/37527903&rfr_iscdi=true