many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to
pursue goals that conflict (ie, are misaligned) with human interests. If trained like today's
most capable models, AGIs could learn to act deceptively to receive higher reward, learn
internally-represented goals which generalize beyond their fine-tuning distributions, and
pursue those goals using power-seeking strategies. We review emerging evidence for these …