Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Provable guarantees for neural networks via gradient feature learning

Z Shi, J Wei, Y Liang - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Neural networks have achieved remarkable empirical performance, while the current
theoretical analysis is not adequate for understanding their success, eg, the Neural Tangent …

Learning a neuron by a shallow relu network: Dynamics and implicit bias for correlated inputs

D Chistikov, M Englert, R Lazic - Advances in Neural …, 2023 - proceedings.neurips.cc
We prove that, for the fundamental regression task of learning a single neuron, training a
one-hidden layer ReLU network of any width by gradient flow from a small initialisation …

Adversarial Robustness of In-Context Learning in Transformers for Linear Regression

U Anwar, J Von Oswald, L Kirsch, D Krueger… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformers have demonstrated remarkable in-context learning capabilities across various
domains, including statistical learning tasks. While previous work has shown that …

Trained transformer classifiers generalize and exhibit benign overfitting in-context

S Frei, G Vardi - arXiv preprint arXiv:2410.01774, 2024 - arxiv.org
Transformers have the capacity to act as supervised learning algorithms: by properly
encoding a set of labeled training (" in-context") examples and an unlabeled test example …

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

B Li, Y Li - arXiv preprint arXiv:2410.08503, 2024 - arxiv.org
Adversarial training is a widely-applied approach to training deep neural networks to be
robust against adversarial perturbation. However, although adversarial training has …

Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks

B Li, Z Pan, K Lyu, J Li - arXiv preprint arXiv:2410.10322, 2024 - arxiv.org
In this work, we investigate a particular implicit bias in the gradient descent training process,
which we term" Feature Averaging", and argue that it is one of the principal factors …

Optimization dependent generalization bound for ReLU networks based on sensitivity in the tangent bundle

D Rácz, M Petreczky, A Csertán, B Daróczy - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advances in deep learning have given us some very promising results on the
generalization ability of deep neural networks, however literature still lacks a comprehensive …

Can Implicit Bias Imply Adversarial Robustness?

H Min, R Vidal - arXiv preprint arXiv:2405.15942, 2024 - arxiv.org
The implicit bias of gradient-based training algorithms has been considered mostly
beneficial as it leads to trained networks that often generalize well. However, Frei et …

MALT Powers Up Adversarial Attacks

O Melamed, G Yehudai, A Shamir - arXiv preprint arXiv:2407.02240, 2024 - arxiv.org
Current adversarial attacks for multi-class classifiers choose the target class for a given input
naively, based on the classifier's confidence levels for various target classes. We present a …