greylock: A Python Package for Measuring The Composition of Complex Datasets

Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have be...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ArXiv.org 2023-12
Hauptverfasser: Nguyen, Phuc, Arora, Rohit, Hill, Elliot D, Braun, Jasper, Morgan, Alexandra, Quintana, Liza M, Mazzoni, Gabrielle, Lee, Ghee Rye, Arnaout, Rima, Arnaout, Ramy
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title ArXiv.org
container_volume
creator Nguyen, Phuc
Arora, Rohit
Hill, Elliot D
Braun, Jasper
Morgan, Alexandra
Quintana, Liza M
Mazzoni, Gabrielle
Lee, Ghee Rye
Arnaout, Rima
Arnaout, Ramy
description Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several e
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_miscellaneous_3085687048</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3085687048</sourcerecordid><originalsourceid>FETCH-proquest_miscellaneous_30856870483</originalsourceid><addsrcrecordid>eNqVirsKwjAUQIMoWLT_kNGlEJM-gptUxUHBoXsJ5fZh06bmpmD_XhEHV6dzDpwZ8bgQ20CGnM9_fEl8xDtjjMcJjyLhkUtlYdKmaHd0T2-Tq01Pb6poVQW0NJZeQeFom76iWQ00Nd1gsHHN-zLlJzU86UE5heBwTRal0gj-lyuyOR2z9BwM1jxGQJd3DRagterBjJgLJqNYJiyU4o_1BSr2QvI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3085687048</pqid></control><display><type>article</type><title>greylock: A Python Package for Measuring The Composition of Complex Datasets</title><source>Free E- Journals</source><creator>Nguyen, Phuc ; Arora, Rohit ; Hill, Elliot D ; Braun, Jasper ; Morgan, Alexandra ; Quintana, Liza M ; Mazzoni, Gabrielle ; Lee, Ghee Rye ; Arnaout, Rima ; Arnaout, Ramy</creator><creatorcontrib>Nguyen, Phuc ; Arora, Rohit ; Hill, Elliot D ; Braun, Jasper ; Morgan, Alexandra ; Quintana, Liza M ; Mazzoni, Gabrielle ; Lee, Ghee Rye ; Arnaout, Rima ; Arnaout, Ramy</creatorcontrib><description>Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.</description><identifier>ISSN: 2331-8422</identifier><identifier>EISSN: 2331-8422</identifier><language>eng</language><ispartof>ArXiv.org, 2023-12</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784</link.rule.ids></links><search><creatorcontrib>Nguyen, Phuc</creatorcontrib><creatorcontrib>Arora, Rohit</creatorcontrib><creatorcontrib>Hill, Elliot D</creatorcontrib><creatorcontrib>Braun, Jasper</creatorcontrib><creatorcontrib>Morgan, Alexandra</creatorcontrib><creatorcontrib>Quintana, Liza M</creatorcontrib><creatorcontrib>Mazzoni, Gabrielle</creatorcontrib><creatorcontrib>Lee, Ghee Rye</creatorcontrib><creatorcontrib>Arnaout, Rima</creatorcontrib><creatorcontrib>Arnaout, Ramy</creatorcontrib><title>greylock: A Python Package for Measuring The Composition of Complex Datasets</title><title>ArXiv.org</title><description>Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.</description><issn>2331-8422</issn><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNqVirsKwjAUQIMoWLT_kNGlEJM-gptUxUHBoXsJ5fZh06bmpmD_XhEHV6dzDpwZ8bgQ20CGnM9_fEl8xDtjjMcJjyLhkUtlYdKmaHd0T2-Tq01Pb6poVQW0NJZeQeFom76iWQ00Nd1gsHHN-zLlJzU86UE5heBwTRal0gj-lyuyOR2z9BwM1jxGQJd3DRagterBjJgLJqNYJiyU4o_1BSr2QvI</recordid><startdate>20231229</startdate><enddate>20231229</enddate><creator>Nguyen, Phuc</creator><creator>Arora, Rohit</creator><creator>Hill, Elliot D</creator><creator>Braun, Jasper</creator><creator>Morgan, Alexandra</creator><creator>Quintana, Liza M</creator><creator>Mazzoni, Gabrielle</creator><creator>Lee, Ghee Rye</creator><creator>Arnaout, Rima</creator><creator>Arnaout, Ramy</creator><scope>7X8</scope></search><sort><creationdate>20231229</creationdate><title>greylock: A Python Package for Measuring The Composition of Complex Datasets</title><author>Nguyen, Phuc ; Arora, Rohit ; Hill, Elliot D ; Braun, Jasper ; Morgan, Alexandra ; Quintana, Liza M ; Mazzoni, Gabrielle ; Lee, Ghee Rye ; Arnaout, Rima ; Arnaout, Ramy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_miscellaneous_30856870483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Nguyen, Phuc</creatorcontrib><creatorcontrib>Arora, Rohit</creatorcontrib><creatorcontrib>Hill, Elliot D</creatorcontrib><creatorcontrib>Braun, Jasper</creatorcontrib><creatorcontrib>Morgan, Alexandra</creatorcontrib><creatorcontrib>Quintana, Liza M</creatorcontrib><creatorcontrib>Mazzoni, Gabrielle</creatorcontrib><creatorcontrib>Lee, Ghee Rye</creatorcontrib><creatorcontrib>Arnaout, Rima</creatorcontrib><creatorcontrib>Arnaout, Ramy</creatorcontrib><collection>MEDLINE - Academic</collection><jtitle>ArXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Nguyen, Phuc</au><au>Arora, Rohit</au><au>Hill, Elliot D</au><au>Braun, Jasper</au><au>Morgan, Alexandra</au><au>Quintana, Liza M</au><au>Mazzoni, Gabrielle</au><au>Lee, Ghee Rye</au><au>Arnaout, Rima</au><au>Arnaout, Ramy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>greylock: A Python Package for Measuring The Composition of Complex Datasets</atitle><jtitle>ArXiv.org</jtitle><date>2023-12-29</date><risdate>2023</risdate><issn>2331-8422</issn><eissn>2331-8422</eissn><abstract>Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.</abstract></addata></record>
fulltext fulltext
identifier ISSN: 2331-8422
ispartof ArXiv.org, 2023-12
issn 2331-8422
2331-8422
language eng
recordid cdi_proquest_miscellaneous_3085687048
source Free E- Journals
title greylock: A Python Package for Measuring The Composition of Complex Datasets
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T16%3A22%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=greylock:%20A%20Python%20Package%20for%20Measuring%20The%20Composition%20of%20Complex%20Datasets&rft.jtitle=ArXiv.org&rft.au=Nguyen,%20Phuc&rft.date=2023-12-29&rft.issn=2331-8422&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3085687048%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3085687048&rft_id=info:pmid/&rfr_iscdi=true