Dataset discovery and exploration: A survey

NW Paton, J Chen, Z Wu - ACM Computing Surveys, 2023 - dl.acm.org
Data scientists are tasked with obtaining insights from data. However, suitable data is often
not immediately at hand, and there may be many potentially relevant datasets in a data lake …

Data preparation: A technological perspective and review

AAA Fernandes, M Koehler, N Konstantinou… - SN Computer …, 2023 - Springer
Data analysis often uses data sets that were collected for different purposes. Indeed, new
insights are often obtained by combining data sets that were produced independently of …

Selective data acquisition in the wild for model charging

C Chai, J Liu, N Tang, G Li, Y Luo - Proceedings of the VLDB …, 2022 - dl.acm.org
The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world
supervised machine learning (ML) tasks. In this paper, we study a new problem, namely …

Data lakes: A survey of functions and systems

R Hai, C Koutras, C Quix… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Data lakes are becoming increasingly prevalent for Big Data management and data
analytics. In contrast to traditional 'schema-on-write'approaches such as data warehouses …

Table discovery in data lakes: State-of-the-art and future directions

G Fan, J Wang, Y Li, RJ Miller - … of the 2023 International Conference on …, 2023 - dl.acm.org
Data discovery refers to a set of tasks that enable users and downstream applications to
explore and gain insights from massive collections of data sources such as data lakes. In …

Deepjoin: Joinable table discovery with pre-trained language models

Y Dong, C Xiao, T Nozawa, M Enomoto… - arXiv preprint arXiv …, 2022 - arxiv.org
Due to the usefulness in data enrichment for data analysis tasks, joinable table discovery
has become an important operation in data lake management. Existing approaches target …

Responsible data integration: Next-generation challenges

F Nargesian, A Asudeh, HV Jagadish - Proceedings of the 2022 …, 2022 - dl.acm.org
Data integration has been extensively studied by the data management community and is a
core task in the data pre-processing step of ML pipelines. When the integrated data is used …

RONIN: data lake exploration

P Ouellette, A Sciortino, F Nargesian… - Proceedings of the …, 2021 - par.nsf.gov
Dataset discovery can be performed using search (with a query or keywords) to find relevant
data. However, the result of this discovery can be overwhelming to explore. Existing …

Metam: Goal-oriented data discovery

S Galhotra, Y Gong… - 2023 IEEE 39th …, 2023 - ieeexplore.ieee.org
Data is a central component of machine learning and causal inference tasks. The availability
of large amounts of data from sources such as open data repositories, data lakes and data …

A demonstration of kglac: A data discovery and enrichment platform for data science

A Helal, M Helali, K Ammar, E Mansour - Proceedings of the VLDB …, 2021 - dl.acm.org
Data science growing success relies on knowing where a relevant dataset exists,
understanding its impact on a specific task, finding ways to enrich a dataset, and leveraging …