Tabular data augmentation for machine learning: Progress and prospects of embracing generative ai

L Cui, H Li, K Chen, L Shou, G Chen - arXiv preprint arXiv:2407.21523, 2024 - arxiv.org
Machine learning (ML) on tabular data is ubiquitous, yet obtaining abundant high-quality
tabular data for model training remains a significant obstacle. Numerous works have …

RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph

LL Wei, G Xiao, M Balazinska - arXiv preprint arXiv:2409.14556, 2024 - arxiv.org
As an important component of data exploration and integration, Column Type Annotation
(CTA) aims to label columns of a table with one or more semantic types. With the recent …

LLM-assisted Labeling Function Generation for Semantic Type Detection

C Li, D Zhang, J Wang - arXiv preprint arXiv:2408.16173, 2024 - arxiv.org
Detecting semantic types of columns in data lake tables is an important application. A key
bottleneck in semantic type detection is the availability of human annotation due to the …

ACCIO: Table Understanding Enhanced via Contrastive Learning with Aggregations

W Cho - arXiv preprint arXiv:2411.04443, 2024 - arxiv.org
The attention to table understanding using recent natural language models has been
growing. However, most related works tend to focus on learning the structure of the table …

Investigation of Simple-but-Effective Architecture for Long-form Text Matching with Transformers

C Shen, J Wang - International Conference on Database Systems for …, 2024 - Springer
Long-form text matching plays a significant role in many real world Natural Language
processing (NLP) and Information Retrieval (IR) applications. Recently Transformer based …

Table Discovery in Data Lakes

G Fan - 2024 - search.proquest.com
Data lakes are massive collections of structured and unstructured datasets. While these
collections consist of various data formats, we focus on tabular data in data lakes. With the …