Tabular data augmentation for machine learning: Progress and prospects of embracing generative ai

L Cui, H Li, K Chen, L Shou, G Chen - arXiv preprint arXiv:2407.21523, 2024 - arxiv.org
Machine learning (ML) on tabular data is ubiquitous, yet obtaining abundant high-quality
tabular data for model training remains a significant obstacle. Numerous works have …

Table discovery in data lakes: State-of-the-art and future directions

G Fan, J Wang, Y Li, RJ Miller - … of the 2023 International Conference on …, 2023 - dl.acm.org
Data discovery refers to a set of tasks that enable users and downstream applications to
explore and gain insights from massive collections of data sources such as data lakes. In …

CHORUS: foundation models for unified data discovery and exploration

M Kayali, A Lykov, I Fountalis, N Vasiloglou… - arXiv preprint arXiv …, 2023 - arxiv.org
We explore the application of foundation models to data discovery and exploration tasks.
Foundation models are large language models (LLMs) that show promising performance on …

Data lake architecture for storing and transforming web server access log files

E Zagan, M Danubianu - IEEE Access, 2023 - ieeexplore.ieee.org
Web server access log files are text files containing important data about server activities,
client requests addressed to a server, server responses, etc. Large-scale analysis of these …

Fainder: A fast and accurate index for distribution-aware dataset search

L Behme, S Galhotra, K Beedkar, V Markl - Proceedings of the VLDB …, 2024 - dl.acm.org
Efficient data discovery is crucial in the era of data-driven decisionmaking. However, current
practices face significant challenges due to the intricacies of identifying datasets with …

[PDF][PDF] The History, Present, and Future of ETL Technology

A Simitsis, S Skiadopoulos, P Vassiliadis - DOLAP, 2023 - cs.uoi.gr
There is an abundance of data, but a large volume of it is unusable. Data may be noisy,
unstructured, stored in incompatible for direct analysis medium or format, and often …

Industrial data space application framework for semiconductor wafer manufacturing system scheduling

D Chen, J Zhang, L Wu, P Zhang, M Wang - Journal of Manufacturing …, 2024 - Elsevier
The complex, large-scale semiconductor wafer manufacturing generates substantial diverse
data, creating management hurdles and making efficient use of historical scheduling data …

Knowledge engineering in the era of artificial intelligence

K Hose - European Conference on Advances in Databases and …, 2023 - Springer
Abstract Knowledge engineering with respect to knowledge graphs and graph data in
general is becoming a more and more essential component of intelligent systems. Such …

A multi-start simulated annealing strategy for Data Lake Organization Problem

D Fernandes, GS Ramos, RGS Pinheiro… - Applied Soft …, 2024 - Elsevier
Abstract The Data Lake Organization Problem consists of optimized data navigation
structures generation to reduce the user's time exploring all available data. The goal is to …

MYCROFT: Towards Effective and Efficient External Data Augmentation

Z Sarwar, V Tran, AN Bhagoji, N Feamster… - arXiv preprint arXiv …, 2024 - arxiv.org
Machine learning (ML) models often require large amounts of data to perform well. When the
available data is limited, model trainers may need to acquire more data from external …