Data cleaning and machine learning: a systematic literature review

PO Côté, A Nikanjam, N Ahmed, D Humeniuk… - Automated Software …, 2024 - Springer
Abstract Machine Learning (ML) is integrated into a growing number of systems for various
applications. Because the performance of an ML model is highly dependent on the quality of …

Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation

R Wang, Y Li, J Wang - 2023 IEEE 39th International …, 2023 - ieeexplore.ieee.org
Machine learning (ML) is playing an increasingly important role in data management tasks,
particularly in Data Integration and Preparation (DI&P). The success of ML-based …

Data is the new oil–sort of: a view on why this comparison is misleading and its implications for modern data administration

C Stach - Future Internet, 2023 - mdpi.com
Currently, data are often referred to as the oil of the 21st century. This comparison is not only
used to express that the resource data are just as important for the fourth industrial …

Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search

J Yang, Y He, S Chaudhuri - arXiv preprint arXiv:2106.13861, 2021 - arxiv.org
Recent work has made significant progress in helping users to automate single data
preparation steps, such as string-transformations and table-manipulation operators (eg …

Automating and optimizing data-centric what-if analyses on native machine learning pipelines

S Grafberger, P Groth, S Schelter - … of the ACM on Management of Data, 2023 - dl.acm.org
Software systems that learn from data with machine learning (ML) are used in critical
decision-making processes. Unfortunately, real-world experience shows that the pipelines …

Time Series Data Cleaning Under Expressive Constraints on Both Rows and Columns

X Ding, G Li, H Wang, C Wang… - 2024 IEEE 40th …, 2024 - ieeexplore.ieee.org
Time series data generated by thousands of sensors are suffering data quality problems.
Traditional constraint-based techniques have greatly contributed to data cleaning …

DataVinci: Learning Syntactic and Semantic String Repairs

M Singh, J Cambronero, S Gulwani, V Le… - arXiv preprint arXiv …, 2023 - arxiv.org
String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real
Excel spreadsheets from the web were represented as text. Systems that successfully clean …

Enhancing data preparation: insights from a time series case study

C Sancricca, G Siracusa, C Cappiello - Journal of Intelligent Information …, 2024 - Springer
Data play a key role in AI systems that support decision-making processes. Data-centric AI
highlights the importance of having high-quality input data to obtain reliable results …

GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models

M Yan, Y Wang, Y Wang, X Miao, J Li - … of the ACM on Management of …, 2024 - dl.acm.org
Data quality is critical across many applications. The utility of data is undermined by various
errors, making rigorous data cleaning a necessity. Traditional data cleaning systems depend …

Improving Understandability and Control in Data Preparation: A Human-Centered Approach

E Pucci, C Sancricca, S Andolina, C Cappiello… - International Conference …, 2024 - Springer
Data preparation is the process of normalizing, cleaning, transforming, and combining data
prior to processing or analysis. It is crucial for obtaining valuable results from data analysis …