Data-centric artificial intelligence: A survey

D Zha, ZP Bhat, KH Lai, F Yang, Z Jiang… - ACM Computing …, 2023 - dl.acm.org
Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler
of its great success is the availability of abundant and high-quality data for building machine …

Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation

M Li, T Shi, C Ziems, MY Kan, NF Chen, Z Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Annotated data plays a critical role in Natural Language Processing (NLP) in training
models and evaluating their performance. Given recent developments in Large Language …

Prboost: Prompt-based rule discovery and boosting for interactive weakly-supervised learning

R Zhang, Y Yu, P Shetty, L Song, C Zhang - arXiv preprint arXiv …, 2022 - arxiv.org
Weakly-supervised learning (WSL) has shown promising results in addressing label scarcity
on many NLP tasks, but manually designing a comprehensive, high-quality labeling rule set …

Language models in the loop: Incorporating prompting into weak supervision

R Smith, JA Fries, B Hancock, SH Bach - ACM/JMS Journal of Data …, 2024 - dl.acm.org
We propose a new strategy for applying large pre-trained language models to novel tasks
when labeled training data is limited. Rather than apply the model in a typical zero-shot or …

Improved active multi-task representation learning via lasso

Y Wang, Y Chen, K Jamieson… - … Conference on Machine …, 2023 - proceedings.mlr.press
To leverage the copious amount of data from source tasks and overcome the scarcity of the
target task samples, representation learning based on multi-task pretraining has become a …

Cold-start data selection for better few-shot language model fine-tuning: A prompt-based uncertainty propagation approach

Y Yu, R Zhang, R Xu, J Zhang, J Shen… - Proceedings of the 61st …, 2023 - aclanthology.org
We present PATRON, a prompt-based data selection method for pre-trained language
model fine-tuning under cold-start scenarios, ie, no initial labeled data are available. In …

Characterizing the Impacts of Semi-supervised Learning for Weak Supervision

J Li, J Zhang, L Schmidt… - Advances in Neural …, 2024 - proceedings.neurips.cc
Labeling training data is a critical and expensive step in producing high accuracy ML
models, whether training from scratch or fine-tuning. To make labeling more efficient, two …

VideoPro: A Visual Analytics Approach for Interactive Video Programming

J He, X Wang, KK Wong, X Huang… - … on Visualization and …, 2023 - ieeexplore.ieee.org
Constructing supervised machine learning models for real-world video analysis require
substantial labeled data, which is costly to acquire due to scarce domain expertise and …

Automatic calibration and error correction for large language models via pareto optimal self-supervision

T Zhao, M Wei, JS Preston, H Poon - arXiv preprint arXiv:2306.16564, 2023 - arxiv.org
Large language models (LLMs) have demonstrated remarkable capabilities out of box for a
wide range of applications, yet accuracy still remains a major growth area, especially in …

Cold-start data selection for few-shot language model fine-tuning: A prompt-based uncertainty propagation approach

Y Yu, R Zhang, R Xu, J Zhang, J Shen… - arXiv preprint arXiv …, 2022 - arxiv.org
Large Language Models have demonstrated remarkable few-shot performance, but the
performance can be sensitive to the selection of few-shot instances. We propose PATRON, a …