… When collecting data in the conservative MDP, we collect h-length truncated trajectories starting from states in the original offline dataset. By collecting data this way, we are able to …
L Shi, G Li, Y Wei, Y Chen… - … on machine learning, 2022 - proceedings.mlr.press
… Offline or batch reinforcementlearning seeks to learn a near-optimal policy using history … To counter the insufficient coverage and sample scarcity of many offline datasets, the principle …
Y Wu, G Tucker, O Nachum - arXiv preprint arXiv:1911.11361, 2019 - arxiv.org
… -optimal policies (eg, robotic control and recommendation systems). To simulate this scenario, we collect the offline dataset with a sub-optimal … evaluate offline RL algorithms by training …
… We study the offline safe RL problem from a novel multi-objective optimization perspective … maximum-reward Pareto optimaltrajectory with cost less than κ. Then we append the new …
T Zhang, J Guan, L Zhao, Y Li, D Li, Z Zeng… - arXiv preprint arXiv …, 2024 - arxiv.org
… trajectory-based preference optimization. We directly generate preferred trajectory data for preference optimization, … We can compare two trajectories based on success or time to …
… , the risk to crash the production machines if optimizing production processes, or the risk to loose … In Offline RL, we assume that a dataset P of trajectories is provided. A single trajectory …
… , latent offline modelbased policy optimization (LOMPO), which enables … on offline RL in high-dimensional POMDPs, where the agent has access to the fixed dataset Denv of trajectories, …
… offline RL setting can make full use of the ability of deep networks to extract the optimal policy from a large amount of offline … discrepancy between the offlinetraining data and the target …
W Zhou, S Bajracharya, D Held - … on Robot Learning, 2021 - proceedings.mlr.press
… [19] samples action sequences from the CVAE when they perform trajectoryoptimization with the learned latent dynamics model. Krupnik et al. [20] extended the previous method to …