Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Natural language reasoning, a survey

F Yu, H Zhang, P Tiwari, B Wang - ACM Computing Surveys, 2023 - dl.acm.org
This survey paper proposes a clearer view of natural language reasoning in the field of
Natural Language Processing (NLP), both conceptually and practically. Conceptually, we …

[PDF][PDF] DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.

B Wang, W Chen, H Pei, C Xie, M Kang, C Zhang, C Xu… - NeurIPS, 2023 - blogs.qub.ac.uk
Abstract Generative Pre-trained Transformer (GPT) models have exhibited exciting progress
in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the …

Evaluating the moral beliefs encoded in llms

N Scherrer, C Shi, A Feder… - Advances in Neural …, 2024 - proceedings.neurips.cc
This paper presents a case study on the design, administration, post-processing, and
evaluation of surveys on large language models (LLMs). It comprises two components:(1) A …

Trustllm: Trustworthiness in large language models

L Sun, Y Huang, H Wang, S Wu, Q Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …

Large pre-trained language models contain human-like biases of what is right and wrong to do

P Schramowski, C Turan, N Andersen… - Nature Machine …, 2022 - nature.com
Artificial writing is permeating our lives due to recent advances in large-scale, transformer-
based language models (LMs) such as BERT, GPT-2 and GPT-3. Using them as pre-trained …

When to make exceptions: Exploring language models as accounts of human moral judgment

Z Jin, S Levine, F Gonzalez Adauto… - Advances in neural …, 2022 - proceedings.neurips.cc
AI systems are becoming increasingly intertwined with human life. In order to effectively
collaborate with humans and ensure safety, AI systems need to be able to understand …

Latent hatred: A benchmark for understanding implicit hate speech

M ElSherief, C Ziems, D Muchlinski, V Anupindi… - arXiv preprint arXiv …, 2021 - arxiv.org
Hate speech has grown significantly on social media, causing serious consequences for
victims of all demographics. Despite much attention being paid to characterize and detect …

The importance of modeling social factors of language: Theory and practice

D Hovy, D Yang - Proceedings of the 2021 Conference of the …, 2021 - aclanthology.org
Natural language processing (NLP) applications are now more powerful and ubiquitous
than ever before. With rapidly developing (neural) models and ever-more available data …

NLPositionality: Characterizing design biases of datasets and models

S Santy, JT Liang, RL Bras, K Reinecke… - arXiv preprint arXiv …, 2023 - arxiv.org
Design biases in NLP systems, such as performance differences for different populations,
often stem from their creator's positionality, ie, views and lived experiences shaped by …