Transfer learning in deep reinforcement learning: A survey

Z Zhu, K Lin, AK Jain, J Zhou - IEEE Transactions on Pattern …, 2023 - ieeexplore.ieee.org
Reinforcement learning is a learning paradigm for solving sequential decision-making
problems. Recent years have witnessed remarkable progress in reinforcement learning …

Pessimistic reward models for off-policy learning in recommendation

O Jeunen, B Goethals - Proceedings of the 15th ACM Conference on …, 2021 - dl.acm.org
Methods for bandit learning from user interactions often require a model of the reward a
certain context-action pair will yield–for example, the probability of a click on a …

Temporal-contextual recommendation in real-time

Y Ma, B Narayanaswamy, H Lin, H Ding - Proceedings of the 26th ACM …, 2020 - dl.acm.org
Personalized real-time recommendation has had a profound impact on retail, media,
entertainment and other industries. However, developing recommender systems for every …

PAC-Bayesian offline contextual bandits with guarantees

O Sakhi, P Alquier, N Chopin - International Conference on …, 2023 - proceedings.mlr.press
This paper introduces a new principled approach for off-policy learning in contextual
bandits. Unlike previous work, our approach does not derive learning principles from …

Pessimistic decision-making for recommender systems

O Jeunen, B Goethals - ACM Transactions on Recommender Systems, 2023 - dl.acm.org
Modern recommender systems are often modelled under the sequential decision-making
paradigm, where the system decides which recommendations to show in order to maximise …

Joint policy-value learning for recommendation

O Jeunen, D Rohde, F Vasile, M Bompaire - Proceedings of the 26th …, 2020 - dl.acm.org
Conventional approaches to recommendation often do not explicitly take into account
information on previously shown recommendations and their recorded responses. One …

Recommendations as treatments

T Joachims, B London, Y Su, A Swaminathan, L Wang - AI Magazine, 2021 - ojs.aaai.org
In recent years, a new line of research has taken an interventional view of recommender
systems, where recommendations are viewed as actions that the system takes to have a …

POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition

Y Saito, J Yao, T Joachims - arXiv preprint arXiv:2402.06151, 2024 - arxiv.org
We study off-policy learning (OPL) of contextual bandit policies in large discrete action
spaces where existing methods--most of which rely crucially on reward-regression models …

Ad-load Balancing via Off-policy Learning in a Content Marketplace

H Sagtani, MG Jhawar, R Mehrotra… - Proceedings of the 17th …, 2024 - dl.acm.org
Ad-load balancing is a critical challenge in online advertising systems, particularly in the
context of social media platforms, where the goal is to maximize user engagement and …

Bayesian counterfactual risk minimization

B London, T Sandler - International Conference on Machine …, 2019 - proceedings.mlr.press
We present a Bayesian view of counterfactual risk minimization (CRM) for offline learning
from logged bandit feedback. Using PAC-Bayesian analysis, we derive a new generalization …