[图书][B] Data cleaning

IF Ilyas, X Chu - 2019 - books.google.com
This is an overview of the end-to-end data cleaning process. Data quality is one of the most
important problems in data management, since dirty data often leads to inaccurate data …

Tabular and latent space synthetic data generation: a literature review

J Fonseca, F Bacao - Journal of Big Data, 2023 - Springer
The generation of synthetic data can be used for anonymization, regularization,
oversampling, semi-supervised learning, self-supervised learning, and several other tasks …

Holodetect: Few-shot learning for error detection

A Heidari, J McGrath, IF Ilyas… - Proceedings of the 2019 …, 2019 - dl.acm.org
We introduce a few-shot learning framework for error detection. We show that data
augmentation (a form of weak supervision) is key to training high-quality, ML-based error …

Machine learning and data cleaning: Which serves the other?

IF Ilyas, T Rekatsinas - ACM Journal of Data and Information Quality …, 2022 - dl.acm.org
The last few years witnessed significant advances in building automated or semi-automated
data quality, data cleaning and data integration systems powered by machine learning (ML) …

Kamino: Constraint-aware differentially private data synthesis

C Ge, S Mohapatra, X He, IF Ilyas - arXiv preprint arXiv:2012.15713, 2020 - arxiv.org
Organizations are increasingly relying on data to support decisions. When data contains
private and sensitive information, the data owner often desires to publish a synthetic …

Computing optimal repairs for functional dependencies

E Livshits, B Kimelfeld, S Roy - ACM Transactions on Database Systems …, 2020 - dl.acm.org
We investigate the complexity of computing an optimal repair of an inconsistent database, in
the case where integrity constraints are Functional Dependencies (FDs). We focus on two …

Database repairs and consistent query answering: Origins and further developments

L Bertossi - Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI …, 2019 - dl.acm.org
In this article we review the main concepts around database repairs and consistent query
answering, with emphasis on tracing back the origin, motivation, and early developments …

A statistical perspective on discovering functional dependencies in noisy data

Y Zhang, Z Guo, T Rekatsinas - Proceedings of the 2020 ACM SIGMOD …, 2020 - dl.acm.org
We study the problem of discovering functional dependencies (FD) from a noisy data set. We
adopt a statistical perspective and draw connections between FD discovery and structure …

PClean: Bayesian data cleaning at scale with domain-specific probabilistic programming

A Lew, M Agrawal, D Sontag… - … conference on artificial …, 2021 - proceedings.mlr.press
Data cleaning is naturally framed as probabilistic inference in a generative model of ground-
truth data and likely errors, but the diversity of real-world error patterns and the hardness of …

The computation of optimal subset repairs

D Miao, Z Cai, J Li, X Gao, X Liu - Proceedings of the VLDB Endowment, 2020 - dl.acm.org
Computing an optimal subset repair of an inconsistent database is becoming a standalone
research problem and has a wide range of applications. However, it has not been well …