Raft: Reward ranked finetuning for generative foundation model alignment H Dong, W Xiong, D Goyal, Z Yihan, C Winnie, R Pan, S Diao, J Zhang, ... TMLR, 2023 | 205 | 2023 |
Mitigating the Alignment Tax of RLHF Y Lin, H Lin, W Xiong, S Diao, J Liu, J Zhang, R Pan, H Wang, W Hu, ... arXiv preprint arXiv:2309.06256, 2023 | 55* | 2023 |
Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint W Xiong, H Dong, C Ye, Z Wang, H Zhong, H Ji, N Jiang, T Zhang ICML 2024, 2023 | 52* | 2023 |
Gec: A unified framework for interactive decision making in mdp, pomdp, and beyond H Zhong, W Xiong, S Zheng, L Wang, Z Wang, Z Yang, T Zhang arXiv preprint arXiv:2211.01962, 2022 | 52* | 2022 |
Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game W Xiong, H Zhong, C Shi, C Shen, L Wang, T Zhang ICLR 2023, 2022 | 44 | 2022 |
Lmflow: An extensible toolkit for finetuning and inference of large foundation models S Diao, R Pan, H Dong, KS Shum, J Zhang, W Xiong, T Zhang NAACL 2024, Best Demo Paper Award, 2023 | 41 | 2023 |
Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets H Zhong, W Xiong, J Tan, L Wang, T Zhang, Z Wang, Z Yang ICML 2022, 2022 | 41 | 2022 |
Decentralized multi-player multi-armed bandits with no collision information C Shi, W Xiong, C Shen, J Yang AISTATS 2020, 2020 | 41 | 2020 |
Maximize to explore: One objective function fusing estimation, planning, and exploration Z Liu, M Lu, W Xiong, H Zhong, H Hu, S Zhang, S Zheng, Z Yang, Z Wang NeurIPS 2023 36, 2024 | 27* | 2024 |
A Self-Play Posterior Sampling Algorithm for Zero-Sum Markov Games W Xiong, H Zhong, C Shi, C Shen, T Zhang ICML 2022, 2022 | 25 | 2022 |
Heterogeneous Multi-player Multi-armed Bandits: Closing the Gap and Generalization C Shi, W Xiong, C Shen, J Yang NeurIPS 2021, 2021 | 24 | 2021 |
Distributional reinforcement learning for multi-dimensional reward functions P Zhang, X Chen, L Zhao, W Xiong, T Qin, TY Liu NeurIPS 2021, 2021 | 20 | 2021 |
Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes C Ye, W Xiong, Q Gu, T Zhang ICML 2023, 2022 | 19 | 2022 |
RLHF Workflow: From Reward Modeling to Online RLHF H Dong, W Xiong, B Pang, H Wang, H Zhao, Y Zhou, N Jiang, D Sahoo, ... arXiv preprint arXiv:2405.07863, 2024 | 18 | 2024 |
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards H Wang, Y Lin, W Xiong, R Yang, S Diao, S Qiu, H Zhao, T Zhang ACL 2024, 2024 | 16 | 2024 |
PMGT-VR: A decentralized proximal-gradient algorithmic framework with variance reduction H Ye, W Xiong, T Zhang arXiv preprint arXiv:2012.15010, 2020 | 16 | 2020 |
Online Iterative Reinforcement Learning from Human Feedback with General Preference Model C Ye, W Xiong, Y Zhang, N Jiang, T Zhang arXiv preprint arXiv:2402.07314, 2024 | 15* | 2024 |
DPO Meets PPO: Reinforced Token Optimization for RLHF H Zhong, G Feng, W Xiong, L Zhao, D He, J Bian, L Wang arXiv preprint arXiv:2404.18922, 2024 | 12 | 2024 |
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts H Wang, W Xiong, T Xie, H Zhao, T Zhang arXiv preprint arXiv:2406.12845, 2024 | 6 | 2024 |
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization R Pi, T Han, W Xiong, J Zhang, R Liu, R Pan, T Zhang ECCV 2024, 2024 | 6 | 2024 |