(\textit{greylock}\): A Python Package for Measuring The Composition of Complex Datasets

Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have be...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-12
Hauptverfasser: Nguyen, Phuc, Arora, Rohit, Hill, Elliot D, Braun, Jasper, Morgan, Alexandra, Quintana, Liza M, Mazzoni, Gabrielle, Lee, Ghee Rye, Arnaout, Rima, Arnaout, Ramy
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed \(\textit{greylock}\), a Python package that calculates diversity measures and is tailored to large datasets. \(\textit{greylock}\) can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain). \(\textit{greylock}\) also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe \(\textit{greylock}\)'s key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating \(\textit{greylock}\)'s applicability across a range of dataset types and fields.
ISSN:2331-8422