Can you rely on your model evaluation? improving model evaluation with synthetic test data

B van Breugel, N Seedat, F Imrie… - Advances in Neural …, 2024 - proceedings.neurips.cc
Evaluating the performance of machine learning models on diverse and underrepresented
subgroups is essential for ensuring fairness and reliability in real-world applications …

Representation matters: Assessing the importance of subgroup allocations in training data

E Rolf, TT Worledge, B Recht… - … Conference on Machine …, 2021 - proceedings.mlr.press
Collecting more diverse and representative training data is often touted as a remedy for the
disparate performance of machine learning predictors across subpopulations. However, a …

Change is hard: A closer look at subpopulation shift

Y Yang, H Zhang, D Katabi, M Ghassemi - arXiv preprint arXiv:2302.12254, 2023 - arxiv.org
Machine learning models often perform poorly on subgroups that are underrepresented in
the training data. Yet, little is understood on the variation in mechanisms that cause …

An investigation of why overparameterization exacerbates spurious correlations

S Sagawa, A Raghunathan… - … on Machine Learning, 2020 - proceedings.mlr.press
We study why overparameterization—increasing model size well beyond the point of zero
training error—can hurt test error on minority groups despite improving average test error …

Fairbatch: Batch selection for model fairness

Y Roh, K Lee, SE Whang, C Suh - arXiv preprint arXiv:2012.01696, 2020 - arxiv.org
Training a fair machine learning model is essential to prevent demographic disparity.
Existing techniques for improving model fairness require broad changes in either data …

The dataset multiplicity problem: How unreliable data impacts predictions

AP Meyer, A Albarghouthi, L D'Antoni - … of the 2023 ACM Conference on …, 2023 - dl.acm.org
We introduce dataset multiplicity, a way to study how inaccuracies, uncertainty, and social
bias in training datasets impact test-time predictions. The dataset multiplicity framework asks …

De-biasing “bias” measurement

K Lum, Y Zhang, A Bower - Proceedings of the 2022 ACM Conference …, 2022 - dl.acm.org
When a model's performance differs across socially or culturally relevant groups–like race,
gender, or the intersections of many such groups–it is often called” biased.” While much of …

Evaluating model performance under worst-case subpopulations

M Li, H Namkoong, S Xia - Advances in Neural Information …, 2021 - proceedings.neurips.cc
The performance of ML models degrades when the training population is different from that
seen under operation. Towards assessing distributional robustness, we study the worst-case …

[PDF][PDF] Adaptive risk minimization: A meta-learning approach for tackling group shift

M Zhang, H Marklund, A Gupta… - arXiv preprint arXiv …, 2020 - marwandebbiche.github.io
A fundamental assumption of most machine learning algorithms is that the training and test
data are drawn from the same underlying distribution. However, this assumption is violated …

The evolution of out-of-distribution robustness throughout fine-tuning

A Andreassen, Y Bahri, B Neyshabur… - arXiv preprint arXiv …, 2021 - arxiv.org
Although machine learning models typically experience a drop in performance on out-of-
distribution data, accuracies on in-versus out-of-distribution data are widely observed to …