Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data Quality

Artificial Intelligence (AI) is rapidly advancing with a data-centric approach suitable for various domains. Nevertheless, AI faces significant challenges, particularly in data quality. Data collection from diverse sources can introduce quality issues that may threaten the development of AI-enabled...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
1. Verfasser:	Recupito, Gilberto
Format:	Video
Sprache:	eng
Schlagworte:	Empirical software engineering
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Artificial Intelligence (AI) is rapidly advancing with a data-centric approach suitable for various domains. Nevertheless, AI faces significant challenges, particularly in data quality. Data collection from diverse sources can introduce quality issues that may threaten the development of AI-enabled systems. A growing concern in this context is the emergence of \textit{data smells} – issues specific to the data used in building AI models, which can have long-term consequences. This preliminary study investigates data smells and their impact on data quality. First, we updated an existing literature review, highlighting data smells and the tools to detect them. Afterward, we evaluated the prevalence of data smells and their correlation with data quality. Our research outcomes contribute to expanding the existing data smells catalog by introducing 12 novel data smells distributed across three additional categories. We note that the correlation between data smells and data quality is notably impactful, exhibiting a pronounced and substantial effect, especially in highly present instances. This research sheds light on the complex relationship between data smells and data quality, providing valuable insights into the challenges of maintaining AI-enabled systems.
DOI:	10.6084/m9.figshare.24598851