Eureka: Human-level reward design via coding large language models

YJ Ma, W Liang, G Wang, DA Huang, O Bastani… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) have excelled as high-level semantic planners for
sequential decision-making tasks. However, harnessing them to learn complex low-level …

Evolutionary reinforcement learning: A survey

H Bai, R Cheng, Y Jin - Intelligent Computing, 2023 - spj.science.org
Reinforcement learning (RL) is a machine learning approach that trains agents to maximize
cumulative rewards through interactions with environments. The integration of RL with deep …

The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications

S Booth, WB Knox, J Shah, S Niekum, P Stone… - Proceedings of the …, 2023 - ojs.aaai.org
In reinforcement learning (RL), a reward function that aligns exactly with a task's true
performance metric is often necessarily sparse. For example, a true task metric might …

Auto mc-reward: Automated dense reward design with large language models for minecraft

H Li, X Yang, Z Wang, X Zhu, J Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
Many reinforcement learning environments (eg Minecraft) provide only sparse rewards that
indicate task completion or failure with binary values. The challenge in exploration efficiency …

Preference transformer: Modeling human preferences using transformers for rl

C Kim, J Park, J Shin, H Lee, P Abbeel… - arXiv preprint arXiv …, 2023 - arxiv.org
Preference-based reinforcement learning (RL) provides a framework to train agents using
human preferences between two behaviors. However, preference-based RL has been …

Evolving reinforcement learning algorithms

JD Co-Reyes, Y Miao, D Peng, E Real, S Levine… - arXiv preprint arXiv …, 2021 - arxiv.org
We propose a method for meta-learning reinforcement learning algorithms by searching
over the space of computational graphs which compute the loss function for a value-based …

Automated reinforcement learning: An overview

RR Afshar, Y Zhang, J Vanschoren… - arXiv preprint arXiv …, 2022 - arxiv.org
Reinforcement Learning and recently Deep Reinforcement Learning are popular methods
for solving sequential decision making problems modeled as Markov Decision Processes …

Sequential preference ranking for efficient reinforcement learning from human feedback

M Hwang, G Lee, H Kee, CW Kim… - Advances in Neural …, 2024 - proceedings.neurips.cc
Reinforcement learning from human feedback (RLHF) alleviates the problem of designing a
task-specific reward function in reinforcement learning by learning it from human preference …

TTOpt: A maximum volume quantized tensor train-based optimization and its application to reinforcement learning

K Sozykin, A Chertkov, R Schutski… - Advances in …, 2022 - proceedings.neurips.cc
We present a novel procedure for optimization based on the combination of efficient
quantized tensor train representation and a generalized maximum matrix volume principle …

Open source vizier: Distributed infrastructure and api for reliable and flexible blackbox optimization

X Song, S Perel, C Lee, G Kochanski… - International …, 2022 - proceedings.mlr.press
Vizier is the de-facto blackbox optimization service across Google, having optimized some of
Google's largest products and research efforts. To operate at the scale of tuning thousands …