Horizon-free reinforcement learning in polynomial time: the power of stationary policies

A Agarwal, Y Jin, T Zhang - The Thirty Sixth Annual …, 2023 - proceedings.mlr.press

We study time-inhomogeneous episodic reinforcement learning (RL) under general function
approximation and sparse rewards. We design a new algorithm, Variance-weighted …

被引用次数：47 相关文章所有 5 个版本

[PDF] neurips.cc

Computationally efficient horizon-free reinforcement learning for linear mixture mdps

D Zhou, Q Gu - Advances in neural information processing …, 2022 - proceedings.neurips.cc

Recent studies have shown that episodic reinforcement learning (RL) is not more difficult
than bandits, even with a long planning horizon and unknown state transitions. However …

被引用次数：43 相关文章所有 8 个版本

[PDF] mlr.press

Optimal multi-distribution learning

Z Zhang, W Zhan, Y Chen, SS Du… - The Thirty Seventh …, 2024 - proceedings.mlr.press

Abstract Multi-distribution learning (MDL), which seeks to learn a shared model that
minimizes the worst-case risk across $ k $ distinct data distributions, has emerged as a …

被引用次数：15 相关文章所有 4 个版本

[PDF] mlr.press

Settling the sample complexity of online reinforcement learning

Z Zhang, Y Chen, JD Lee… - The Thirty Seventh Annual …, 2024 - proceedings.mlr.press

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency.
While a number of recent works achieved asymptotically minimal regret in online RL, the …

被引用次数：21 相关文章所有 3 个版本

[PDF] mlr.press

Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency

H Zhao, J He, D Zhou, T Zhang… - The Thirty Sixth Annual …, 2023 - proceedings.mlr.press

Recently, several studies\citep {zhou2021nearly, zhang2021variance, kim2021improved,
zhou2022computationally} have provided variance-dependent regret bounds for linear …

被引用次数：31 相关文章所有 5 个版本

[PDF] arxiv.org

Gec: A unified framework for interactive decision making in mdp, pomdp, and beyond

H Zhong, W Xiong, S Zheng, L Wang, Z Wang… - arXiv preprint arXiv …, 2022 - arxiv.org

We study sample efficient reinforcement learning (RL) under the general framework of
interactive decision making, which includes Markov decision process (MDP), partially …

被引用次数：31 相关文章所有 3 个版本

[PDF] mlr.press

Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments

R Zhou, Z Zihan, SS Du - International Conference on …, 2023 - proceedings.mlr.press

We study variance-dependent regret bounds for Markov decision processes (MDPs).
Algorithms with variance-dependent regret guarantees can automatically exploit …

被引用次数：11 相关文章所有 6 个版本

[PDF] mlr.press

Optimal horizon-free reward-free exploration for linear mixture mdps

J Zhang, W Zhang, Q Gu - International Conference on …, 2023 - proceedings.mlr.press

We study reward-free reinforcement learning (RL) with linear function approximation, where
the agent works in two phases:(1) in the exploration phase, the agent interacts with the …

被引用次数：8 相关文章所有 7 个版本

[PDF] acm.org

[PDF][PDF] Exploring and learning in sparse linear mdps without computationally intractable oracles

N Golowich, A Moitra, D Rohatgi - Proceedings of the 56th Annual ACM …, 2024 - dl.acm.org

The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner
has access to a known feature map φ (x, a) that maps state-action pairs to d-dimensional …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Is behavior cloning all you need? understanding horizon in imitation learning

DJ Foster, A Block, D Misra - arXiv preprint arXiv:2407.15007, 2024 - arxiv.org

Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision
making task by learning from demonstrations, and has been widely applied to robotics …

被引用次数：5 相关文章所有 4 个版本

高级搜索

QQ 群