A survey on dataset quality in machine learning

With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are essential for the realization of data value. This survey article summarizes the research direction of dataset quality in machine learning, in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information and software technology 2023-10, Vol.162, p.107268, Article 107268
Hauptverfasser: Gong, Youdi, Liu, Guangzhen, Xue, Yunzhi, Li, Rui, Meng, Lingzhong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With the rise of big data, the quality of datasets has become a crucial factor affecting the performance of machine learning models. High-quality datasets are essential for the realization of data value. This survey article summarizes the research direction of dataset quality in machine learning, including the definition of related concepts, analysis of quality issues and risks, and a review of dataset quality dimensions and metrics throughout the dataset lifecycle and a review of dataset quality metrics analyzed from a dataset lifecycle perspective and summarized in literatures. Furthermore, this article introduces a comprehensive quality evaluation process, which includes a framework for dataset quality evaluation with dimensions and metrics, computation methods for quality metrics, and assessment models. These studies provide valuable guidance for evaluating dataset quality in the field of machine learning, which can help improve the accuracy, efficiency, and generalization ability of machine learning models, and promote the development and application of artificial intelligence technology.
ISSN:0950-5849
1873-6025
DOI:10.1016/j.infsof.2023.107268