Data collection and quality challenges in deep learning: A data-centric ai perspective

SE Whang, Y Roh, H Song, JG Lee - The VLDB Journal, 2023 - Springer
Data-centric AI is at the center of a fundamental shift in software engineering where machine
learning becomes the new software, powered by big data and computing infrastructure …

The effects of data quality on machine learning performance

L Budach, M Feuerpfeil, N Ihde, A Nathansen… - arXiv preprint arXiv …, 2022 - arxiv.org
Modern artificial intelligence (AI) applications require large quantities of training and test
data. This need creates critical challenges not only concerning the availability of such data …

[HTML][HTML] Automated data processing and feature engineering for deep learning and big data applications: a survey

A Mumuni, F Mumuni - Journal of Information and Intelligence, 2024 - Elsevier
Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly
from data. This approach has achieved impressive results and has contributed significantly …

Opportunities and Challenges in Data-Centric AI

S Kumar, S Datta, V Singh, SK Singh, R Sharma - IEEE Access, 2024 - ieeexplore.ieee.org
Artificial intelligence (AI) systems are trained to solve complex problems and learn to
perform specific tasks by using large volumes of data, such as prediction, classification …

Selective data acquisition in the wild for model charging

C Chai, J Liu, N Tang, G Li, Y Luo - Proceedings of the VLDB …, 2022 - dl.acm.org
The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world
supervised machine learning (ML) tasks. In this paper, we study a new problem, namely …

Construction of knowledge graphs: State and challenges

M Hofer, D Obraczka, A Saeedi, H Köpcke… - arXiv preprint arXiv …, 2023 - arxiv.org
With knowledge graphs (KGs) at the center of numerous applications such as recommender
systems and question answering, the need for generalized pipelines to construct and …

Goodcore: Data-effective and data-efficient machine learning through coreset selection over incomplete data

C Chai, J Liu, N Tang, J Fan, D Miao, J Wang… - Proceedings of the …, 2023 - dl.acm.org
Given a dataset with incomplete data (eg, missing values), training a machine learning
model over the incomplete data requires two steps. First, it requires a data-effective step that …

Automated data cleaning can hurt fairness in machine learning-based decision making

S Guha, FA Khan, J Stoyanovich… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In this paper, we interrogate whether data quality issues track demographic group
membership (based on sex, race and age) and whether automated data cleaning—of the …

Data Cleaning and AutoML: Would an optimizer choose to clean?

F Neutatz, B Chen, Y Alkhatib, J Ye, Z Abedjan - Datenbank-Spektrum, 2022 - Springer
Data cleaning is widely acknowledged as an important yet tedious task when dealing with
large amounts of data. Thus, there is always a cost-benefit trade-off to consider. In particular …

Data-centric machine learning for geospatial remote sensing data

R Roscher, M Rußwurm, C Gevaert… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent developments and research in modern machine learning have led to substantial
improvements in the geospatial field. Although numerous deep learning models have been …