RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes

Large language models (LLMs) have shown great potential in data cleaning, which is a fundamental task in all modern applications. In this demo proposal, we demonstrate that indeed LLMs can assist in data cleaning, e.g., filling in missing values in a data table, through different approaches. For exa...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings of the VLDB Endowment 2024-08, Vol.17 (12), p.4421-4424
Hauptverfasser:	Naeem, Zan Ahmad, Ahmad, Mohammad Shahmeer, Eltabakh, Mohamed, Ouzzani, Mourad, Tang, Nan
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Large language models (LLMs) have shown great potential in data cleaning, which is a fundamental task in all modern applications. In this demo proposal, we demonstrate that indeed LLMs can assist in data cleaning, e.g., filling in missing values in a data table, through different approaches. For example, cloud-based non-private LLMs, e.g., OpenAI GPT family or Google Gemini, can assist in cleaning non-private datasets that encompass world-knowledge information (Scenario 1). However, such LLMs may struggle with datasets that they have never encountered before, e.g., local enterprise data, or when the user requires an explanation of the source of the suggested clean values. In that case, retrieval-based methods using RAG (Retrieval Augmented Generation) that complements the LLM power with a user-provided data source, e.g., a data lake, are a must. The data lake is indexed, and each time a new request comes, we retrieve the top- k relevant tuples to the user's query tuple to be cleaned and leverage LLM inference power to infer the correct value (Scenario 2). Nevertheless, even in Scenario 2, sharing enterprise data with public LLMs (an externally hosted model) might not be feasible for privacy reasons. In this scenario, we showcase the practicality of locally hosted small LLMs in the cleaning process, especially after fine-tuning them on a small number of examples (Scenario 3). Our proposed system, RetClean , seamlessly supports all three scenarios and provides a user-friendly GUI that enables the VLDB audience to explore and experiment with different LLMs and investigate their trade-offs.
ISSN:	2150-8097 2150-8097
DOI:	10.14778/3685800.3685890