greylock: A Python Package for Measuring The Composition of Complex Datasets

Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have be...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ArXiv.org 2023-12
Hauptverfasser:	Nguyen, Phuc, Arora, Rohit, Hill, Elliot D, Braun, Jasper, Morgan, Alexandra, Quintana, Liza M, Mazzoni, Gabrielle, Lee, Ghee Rye, Arnaout, Rima, Arnaout, Ramy
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	ArXiv.org
container_volume
creator	Nguyen, Phuc Arora, Rohit Hill, Elliot D Braun, Jasper Morgan, Alexandra Quintana, Liza M Mazzoni, Gabrielle Lee, Ghee Rye Arnaout, Rima Arnaout, Ramy
description	Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several e
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_miscellaneous_3085687048</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3085687048</sourcerecordid><originalsourceid>FETCH-proquest_miscellaneous_30856870483</originalsourceid><addsrcrecordid>eNqVirsKwjAUQIMoWLT_kNGlEJM-gptUxUHBoXsJ5fZh06bmpmD_XhEHV6dzDpwZ8bgQ20CGnM9_fEl8xDtjjMcJjyLhkUtlYdKmaHd0T2-Tq01Pb6poVQW0NJZeQeFom76iWQ00Nd1gsHHN-zLlJzU86UE5heBwTRal0gj-lyuyOR2z9BwM1jxGQJd3DRagterBjJgLJqNYJiyU4o_1BSr2QvI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3085687048</pqid></control><display><type>article</type><title>greylock: A Python Package for Measuring The Composition of Complex Datasets</title><source>Free E- Journals</source><creator>Nguyen, Phuc ; Arora, Rohit ; Hill, Elliot D ; Braun, Jasper ; Morgan, Alexandra ; Quintana, Liza M ; Mazzoni, Gabrielle ; Lee, Ghee Rye ; Arnaout, Rima ; Arnaout, Ramy</creator><creatorcontrib>Nguyen, Phuc ; Arora, Rohit ; Hill, Elliot D ; Braun, Jasper ; Morgan, Alexandra ; Quintana, Liza M ; Mazzoni, Gabrielle ; Lee, Ghee Rye ; Arnaout, Rima ; Arnaout, Ramy</creatorcontrib><description>Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.</description><identifier>ISSN: 2331-8422</identifier><identifier>EISSN: 2331-8422</identifier><language>eng</language><ispartof>ArXiv.org, 2023-12</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784</link.rule.ids></links><search><creatorcontrib>Nguyen, Phuc</creatorcontrib><creatorcontrib>Arora, Rohit</creatorcontrib><creatorcontrib>Hill, Elliot D</creatorcontrib><creatorcontrib>Braun, Jasper</creatorcontrib><creatorcontrib>Morgan, Alexandra</creatorcontrib><creatorcontrib>Quintana, Liza M</creatorcontrib><creatorcontrib>Mazzoni, Gabrielle</creatorcontrib><creatorcontrib>Lee, Ghee Rye</creatorcontrib><creatorcontrib>Arnaout, Rima</creatorcontrib><creatorcontrib>Arnaout, Ramy</creatorcontrib><title>greylock: A Python Package for Measuring The Composition of Complex Datasets</title><title>ArXiv.org</title><description>Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.</description><issn>2331-8422</issn><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNqVirsKwjAUQIMoWLT_kNGlEJM-gptUxUHBoXsJ5fZh06bmpmD_XhEHV6dzDpwZ8bgQ20CGnM9_fEl8xDtjjMcJjyLhkUtlYdKmaHd0T2-Tq01Pb6poVQW0NJZeQeFom76iWQ00Nd1gsHHN-zLlJzU86UE5heBwTRal0gj-lyuyOR2z9BwM1jxGQJd3DRagterBjJgLJqNYJiyU4o_1BSr2QvI</recordid><startdate>20231229</startdate><enddate>20231229</enddate><creator>Nguyen, Phuc</creator><creator>Arora, Rohit</creator><creator>Hill, Elliot D</creator><creator>Braun, Jasper</creator><creator>Morgan, Alexandra</creator><creator>Quintana, Liza M</creator><creator>Mazzoni, Gabrielle</creator><creator>Lee, Ghee Rye</creator><creator>Arnaout, Rima</creator><creator>Arnaout, Ramy</creator><scope>7X8</scope></search><sort><creationdate>20231229</creationdate><title>greylock: A Python Package for Measuring The Composition of Complex Datasets</title><author>Nguyen, Phuc ; Arora, Rohit ; Hill, Elliot D ; Braun, Jasper ; Morgan, Alexandra ; Quintana, Liza M ; Mazzoni, Gabrielle ; Lee, Ghee Rye ; Arnaout, Rima ; Arnaout, Ramy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_miscellaneous_30856870483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Nguyen, Phuc</creatorcontrib><creatorcontrib>Arora, Rohit</creatorcontrib><creatorcontrib>Hill, Elliot D</creatorcontrib><creatorcontrib>Braun, Jasper</creatorcontrib><creatorcontrib>Morgan, Alexandra</creatorcontrib><creatorcontrib>Quintana, Liza M</creatorcontrib><creatorcontrib>Mazzoni, Gabrielle</creatorcontrib><creatorcontrib>Lee, Ghee Rye</creatorcontrib><creatorcontrib>Arnaout, Rima</creatorcontrib><creatorcontrib>Arnaout, Ramy</creatorcontrib><collection>MEDLINE - Academic</collection><jtitle>ArXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Nguyen, Phuc</au><au>Arora, Rohit</au><au>Hill, Elliot D</au><au>Braun, Jasper</au><au>Morgan, Alexandra</au><au>Quintana, Liza M</au><au>Mazzoni, Gabrielle</au><au>Lee, Ghee Rye</au><au>Arnaout, Rima</au><au>Arnaout, Ramy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>greylock: A Python Package for Measuring The Composition of Complex Datasets</atitle><jtitle>ArXiv.org</jtitle><date>2023-12-29</date><risdate>2023</risdate><issn>2331-8422</issn><eissn>2331-8422</eissn><abstract>Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed greylock, a Python package that calculates diversity measures and is tailored to large datasets. greylock can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). greylock also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe greylock's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating greylock's applicability across a range of dataset types and fields.</abstract></addata></record>
fulltext	fulltext
identifier	ISSN: 2331-8422
ispartof	ArXiv.org, 2023-12
issn	2331-8422 2331-8422
language	eng
recordid	cdi_proquest_miscellaneous_3085687048
source	Free E- Journals
title	greylock: A Python Package for Measuring The Composition of Complex Datasets
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T16%3A22%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=greylock:%20A%20Python%20Package%20for%20Measuring%20The%20Composition%20of%20Complex%20Datasets&rft.jtitle=ArXiv.org&rft.au=Nguyen,%20Phuc&rft.date=2023-12-29&rft.issn=2331-8422&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3085687048%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3085687048&rft_id=info:pmid/&rfr_iscdi=true