Scalable agent alignment via reward modeling: a research direction

J Leike, D Krueger, T Everitt, M Martic, V Maini… - arXiv preprint arXiv …, 2018 - arxiv.org
One obstacle to applying reinforcement learning algorithms to real-world problems is the
lack of suitable reward functions. Designing such reward functions is difficult in part because …

Taxonomy of machine learning safety: A survey and primer

S Mohseni, H Wang, C Xiao, Z Yu, Z Wang… - ACM Computing …, 2022 - dl.acm.org
The open-world deployment of Machine Learning (ML) algorithms in safety-critical
applications such as autonomous vehicles needs to address a variety of ML vulnerabilities …

Certified adversarial robustness via randomized smoothing

J Cohen, E Rosenfeld, Z Kolter - international conference on …, 2019 - proceedings.mlr.press
We show how to turn any classifier that classifies well under Gaussian noise into a new
classifier that is certifiably robust to adversarial perturbations under the L2 norm. While this" …

Adversarial glue: A multi-task benchmark for robustness evaluation of language models

B Wang, C Xu, S Wang, Z Gan, Y Cheng, J Gao… - arXiv preprint arXiv …, 2021 - arxiv.org
Large-scale pre-trained language models have achieved tremendous success across a
wide range of natural language understanding (NLU) tasks, even surpassing human …

Provably robust deep learning via adversarially trained smoothed classifiers

H Salman, J Li, I Razenshteyn… - Advances in neural …, 2019 - proceedings.neurips.cc
Recent works have shown the effectiveness of randomized smoothing as a scalable
technique for building neural network-based classifiers that are provably robust to $\ell_2 …

Robustness may be at odds with accuracy

D Tsipras, S Santurkar, L Engstrom, A Turner… - arXiv preprint arXiv …, 2018 - arxiv.org
We show that there may exist an inherent tension between the goal of adversarial
robustness and that of standard generalization. Specifically, training robust models may not …

Certified robustness to adversarial examples with differential privacy

M Lecuyer, V Atlidakis, R Geambasu… - … IEEE symposium on …, 2019 - ieeexplore.ieee.org
Adversarial examples that fool machine learning models, particularly deep neural networks,
have been a topic of intense research interest, with attacks and defenses being developed …

When does contrastive learning preserve adversarial robustness from pretraining to finetuning?

L Fan, S Liu, PY Chen, G Zhang… - Advances in neural …, 2021 - proceedings.neurips.cc
Contrastive learning (CL) can learn generalizable feature representations and achieve state-
of-the-art performance of downstream tasks by finetuning a linear classifier on top of it …

On the effectiveness of interval bound propagation for training verifiably robust models

S Gowal, K Dvijotham, R Stanforth, R Bunel… - arXiv preprint arXiv …, 2018 - arxiv.org
Recent work has shown that it is possible to train deep neural networks that are provably
robust to norm-bounded adversarial perturbations. Most of these methods are based on …

Semidefinite relaxations for certifying robustness to adversarial examples

A Raghunathan, J Steinhardt… - Advances in neural …, 2018 - proceedings.neurips.cc
Despite their impressive performance on diverse tasks, neural networks fail catastrophically
in the presence of adversarial inputs—imperceptibly but adversarially perturbed versions of …