Less is better: Unweighted data subsampling via influence function

Z Wang, H Zhu, Z Dong, X He, SL Huang - Proceedings of the AAAI …, 2020 - aaai.org
In the time of Big Data, training complex models on large-scale data sets is challenging,
making it appealing to reduce data volume for saving computation resources by …

Optimal subsampling with influence functions

D Ting, E Brochu - Advances in neural information …, 2018 - proceedings.neurips.cc
Subsampling is a common and often effective method to deal with the computational
challenges of large datasets. However, for most statistical models, there is no well-motivated …

Subsampling for partial least-squares regression via an influence function

Z Xie, X Chen - Knowledge-Based Systems, 2022 - Elsevier
Partial least squares (PLS) performs well for high-dimensional regression problems, where
the number of predictors can far exceed the number of observations. Similar to many other …

Leveraging for big data regression

P Ma, X Sun - Wiley Interdisciplinary Reviews: Computational …, 2015 - Wiley Online Library
Rapid advance in science and technology in the past decade brings an extraordinary
amount of data, offering researchers an unprecedented opportunity to tackle complex …

Subsampling and jackknifing: a practically convenient solution for large data analysis with limited computational resources

S Wu, X Zhu, H Wang - arXiv preprint arXiv:2304.06231, 2023 - arxiv.org
Modern statistical analysis often encounters datasets with large sizes. For these datasets,
conventional estimation methods can hardly be used immediately because practitioners …

Sapprox: Enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling

X Zhang, J Wang, J Yin - Proceedings of the VLDB Endowment, 2016 - dl.acm.org
In this paper, we aim to enable both efficient and accurate approximations on arbitrary sub-
datasets of a large dataset. Due to the prohibitive storage overhead of caching offline …

Optimal subsampling approaches for large sample linear regression

R Zhu, P Ma, MW Mahoney, B Yu - arXiv preprint arXiv:1509.05111, 2015 - arxiv.org
A significant hurdle for analyzing large sample data is the lack of effective statistical
computing and inference methods. An emerging powerful approach for analyzing large …

A review on optimal subsampling methods for massive datasets

Y Yao, HY Wang - Journal of Data Science, 2021 - airitilibrary.com
Subsampling is an effective way to deal with big data problems and many subsampling
approaches have been proposed for different models, such as leverage sampling for linear …

Sos: Score-based oversampling for tabular data

J Kim, C Lee, Y Shin, S Park, M Kim, N Park… - Proceedings of the 28th …, 2022 - dl.acm.org
Score-based generative models (SGMs) are a recent breakthrough in generating fake
images. SGMs are known to surpass other generative models, eg, generative adversarial …

Reparameterizable subset sampling via continuous relaxations

SM Xie, S Ermon - arXiv preprint arXiv:1901.10517, 2019 - arxiv.org
Many machine learning tasks require sampling a subset of items from a collection based on
a parameterized distribution. The Gumbel-softmax trick can be used to sample a single item …