Saits: Self-attention-based imputation for time series

W Du, D Côté, Y Liu - Expert Systems with Applications, 2023 - Elsevier
Missing data in time series is a pervasive problem that puts obstacles in the way of
advanced analysis. A popular solution is imputation, where the fundamental challenge is to …

NeuroCard: one cardinality estimator for all tables

Z Yang, A Kamsetty, S Luan, E Liang, Y Duan… - arXiv preprint arXiv …, 2020 - arxiv.org
Query optimizers rely on accurate cardinality estimates to produce good execution plans.
Despite decades of research, existing cardinality estimators are inaccurate for complex …

Advances in Biomedical Missing Data Imputation: A Survey

M Barrabés, M Perera, VN Moriano, X Giró-I-Nieto… - IEEE …, 2024 - ieeexplore.ieee.org
Ensuring good data quality in biomedical sciences is crucial for reliable research outcomes,
particularly as precision medicine continues to gain prominence. Missing values …

Multi-directional temporal convolutional artificial neural network for PM2. 5 forecasting with missing values: A deep learning approach

KKR Samal, KS Babu, SK Das - Urban Climate, 2021 - Elsevier
Data imputation and forecasting are the major research areas in environmental data
engineering. Solving those critical issues has an immense impact on air pollution …

Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries

U An, A Pazokitoroudi, M Alvarez, L Huang, S Bacanu… - Nature Genetics, 2023 - nature.com
Biobanks that collect deep phenotypic and genomic data across many individuals have
emerged as a key resource in human genetics. However, phenotypes in biobanks are often …

Machine learning and data cleaning: Which serves the other?

IF Ilyas, T Rekatsinas - ACM Journal of Data and Information Quality …, 2022 - dl.acm.org
The last few years witnessed significant advances in building automated or semi-automated
data quality, data cleaning and data integration systems powered by machine learning (ML) …

Saga: A platform for continuous construction and serving of knowledge at scale

IF Ilyas, T Rekatsinas, V Konda, J Pound, X Qi… - Proceedings of the …, 2022 - dl.acm.org
We introduce Saga, a next-generation knowledge construction and serving platform for
powering knowledge-based applications at industrial scale. Saga follows a hybrid batch …

Goodcore: Data-effective and data-efficient machine learning through coreset selection over incomplete data

C Chai, J Liu, N Tang, J Fan, D Miao, J Wang… - Proceedings of the …, 2023 - dl.acm.org
Given a dataset with incomplete data (eg, missing values), training a machine learning
model over the incomplete data requires two steps. First, it requires a data-effective step that …

Jellyfish: Instruction-tuning local large language models for data preprocessing

H Zhang, Y Dong, C Xiao… - Proceedings of the 2024 …, 2024 - aclanthology.org
This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the
data mining pipeline that transforms raw data into a clean format. We instruction-tune local …

Kamino: Constraint-aware differentially private data synthesis

C Ge, S Mohapatra, X He, IF Ilyas - arXiv preprint arXiv:2012.15713, 2020 - arxiv.org
Organizations are increasingly relying on data to support decisions. When data contains
private and sensitive information, the data owner often desires to publish a synthetic …