Federated Offline Reinforcement Learning with Proximal Policy Evaluation

YUE Sheng; DENG Yongheng; HUA Xingyuan; WANG Guanbo; REN Ju; ZHANG Yaoxue

doi:10.23919/cje.2023.00.288

Article Contents

Article Navigation > Chinese Journal of Electronics > 2024 > Uncorrected proof

Sheng YUE, Yongheng DENG, Xingyuan HUA, et al., “Federated Offline Reinforcement Learning with Proximal Policy Evaluation,” Chinese Journal of Electronics, vol. x, no. x, pp. 1–14, xxxx doi: 10.23919/cje.2023.00.288

Citation:

Sheng YUE, Yongheng DENG, Xingyuan HUA, et al., “Federated Offline Reinforcement Learning with Proximal Policy Evaluation,” Chinese Journal of Electronics, vol. x, no. x, pp. 1–14, xxxx doi: 10.23919/cje.2023.00.288

Citation:

PDF( 10909 KB)

Federated Offline Reinforcement Learning with Proximal Policy Evaluation

doi: 10.23919/cje.2023.00.288

YUE Sheng^1
,,
DENG Yongheng^1
,,
HUA Xingyuan^2
,,
WANG Guanbo^1
,,
REN Ju^{1, 3
,
,},
ZHANG Yaoxue^{1, 3
,}

1.
Department of Computer Science and Technology, BNRist, Tsinghua University, Beijing, China
2.
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
3.
Zhongguancun Laboratory, Beijing, China

More Information

Author Bio:
Sheng YUE received his B.S. degree in mathematics and his Ph.D. degree in computer science from Central South University, Changsha, China, in 2017 and 2022, respectively. Currently, he is a postdoc with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His research interests include network optimization, distributed learning, and reinforcement learning. (Email: shengyue@tsinghua.edu.cn)

Yongheng DENG received her B.S. degree in computer science from Nankai University, Tianjin, China, in 2019, and his Ph.D. degree in computer science from Tsinghua University, Beijing, China, in 2024. She is currently a postdoc with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. Her research interests include federated learning, edge intelligence, distributed systems, and mobile/edge computing. (Email: dyh19@mails.tsinghua.edu.cn)

Xingyuan HUA is pursuing his B.S. degree with the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His research interests include federated learning and reinforcement. (Email: xingyuanhua@bit.edu.cn)

Guanbo WANG received the B.S. degree in software engineering from Tongji University, Shanghai, China. Currently, he is pursuing his Ph.D. degree in computer science with Tsinghua University, Beijing, China. His research interests include federated learning, reinforcement learning, and recommendation systems. (Email: wanggb23@mails.tsinghua.edu.cn)

Ju REN received the B.S., M.S., and Ph.D. degrees all in computer science from Central South University, Changsha, China, in 2009, 2012, and 2016, respectively. Currently, he is an Associate Professor with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His research interests include Internet-of-Things, edge computing, edge intelligence, as well as security and privacy. He currently serves as an Associate Editor for many journals, including IEEE Transactions on Mobile Computing, IEEE Transactions on Cloud Computing, and IEEE Transactions on Vehicular Technology. He also served as the General Co-chair for IEEE BigDataSE’20, the TPC Co-chair for IEEE BigDataSE’19, the Track Co-chair for IEEE ICDCS’24, the Poster Co-chair for IEEE MASS’18, a Symposium Co-chair for IEEE/CIC ICCC’23&19, I-SPAN’18 and IEEE VTC’17 Fall, etc. He received several best paper awards from IEEE flagship conferences, including IEEE ICC’19 and IEEE HPCC’19, the IEEE TCSC Early Career Researcher Award (2019), and the IEEE ComSoc Asia-Pacific Best Young Researcher Award (2021). He was recognized as a Highly Cited Researcher by Clarivate (2020-2022). (Email: renju@tsinghua.edu.cn)

Yaoxue ZHANG received the B.S. degree from Northwest Institute of Telecommunication Engineering, Xi’an, China, in 1982, and the Ph.D. degree in computer networking from Tohoku University, Sendai, Japan, in 1989. He is currently a Professor with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His research interests include computer networking, operating systems, and transparent computing. He has published more than 200 papers on peer-reviewed IEEE/ACM journals and conferences. He is the Editor-in-Chief of Chinese Journal of Electronics and a Fellow of the Chinese Academy of Engineering. (Email: zhangyx@tsinghua.edu.cn)
Corresponding author: Email: renju@tsinghua.edu.cn
Available Online: 2024-04-13

Abstract

Abstract

Offline reinforcement learning (RL) has gathered increasing attention in recent years, which seeks to learn policies from static datasets without active online exploration. However, the existing offline RL approaches often require a large amount of pre-collected data and hence are hardly implemented by a single agent in practice. Inspired by the advancement of federated learning (FL), this paper studies federated offline reinforcement learning (FORL), whereby multiple agents collaboratively carry out offline policy learning with no need to share their raw trajectories. Clearly, a straightforward solution is to simply retrofit the off-the-shelf offline RL methods for FL, whereas such an approach easily overfits individual datasets during local updating, leading to instability and subpar performance. To overcome this challenge, we propose a new FORL algorithm, named MF-FORL, that exploits novel “proximal local policy evaluation” to judiciously push up action values beyond local data support, enabling agents to capture the individual information without forgetting the aggregated knowledge. Further, we introduce a model-based variant, MB-FORL, capable of improving the generalization ability and computational efficiency via utilizing a learned dynamics model. We evaluate the proposed algorithms on a suite of complex and high-dimensional offline RL benchmarks, and the results demonstrate significant performance gains over the baselines.
- Offline reinforcement learning,
- Batch reinforcement learning,
- Federated learning,
- Reinforcement learning

1It is essential to distinguish federated RL from multi-agent RL [31], [32], where the latter centers on agents’ interactions within a shared environment.

FullText(HTML)

References(49)

References

[1]	D. Kalashnikov, A. Irpan, P. Pastor, et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Proceedings of the 2nd Conference on Robot Learning, Zürich, Switzerland, pp. 651–673, 2018.
[2]	G. DeepMind, “AlphaStar: Mastering the real-time strategy game starcraft II,” Available at: https://deepmind.google/discover/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii/, 2019-01-24.
[3]	L. H. Li, W. Chu, J. Langford, et al., “A contextual-bandit approach to personalized news article recommendation,” in Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, pp. 661–670, 2010.
[4]	P. S. Thomas, “Safe reinforcement learning,” Ph. D. Thesis, University of Massachusetts, MA, USA, 2015.
[5]	S. Levine, A. Kumar, G. Tucker, et al., “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint, arXiv: 2005.01643, 2020.
[6]	A. Kumar, A. Zhou, G. Tucker, et al., “Conservative Q-learning for offline reinforcement learning,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, article no. 100, 2020.
[7]	T. H. Yu, G. Thomas, L. T. Yu, et al., “MOPO: Model-based offline policy optimization,” in Proceedings of the 34th Conference on Neural Information Processing Systems, Online, pp. 14129–14142, 2020.
[8]	T. H. Yu, A. Kumar, R. Rafailov, et al., “COMBO: Conservative offline model-based policy optimization,” in Proceedings of the 35th Conference on Neural Information Processing Systems, Online, 2021.
[9]	D. Ghosh, A. Ajay, P. Agrawal, et al., “Offline RL policies should be trained to be adaptive,” in Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, pp. 7513–7530, 2022.
[10]	B. Trabucco, X. Y. Geng, A. Kumar, et al., “Design-bench: Benchmarks for data-driven offline model-based optimization,” in Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, pp. 21658–21676, 2022.
[11]	K. M. He, X. Y. Zhang, S. Q. Ren, et al., “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 770–778, 2016.
[12]	P. Rashidinejad, B. H. Zhu, C. Ma, et al., “Bridging offline reinforcement learning and imitation learning: A tale of pessimism,” in Proceedings of the 35th Conference on Neural Information Processing Systems, Online, pp. 11702–11716, 2021.
[13]	S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, pp. 2052–2062, 2019.
[14]	S. Lin, J. L. Wan, T. Y. Xu, et al., “Model-based offline meta-reinforcement learning with regularization,” in Proceedings of the Tenth International Conference on Learning Representations, Online, 2022.
[15]	P. Kairouz, H. B. McMahan, B. Avent, et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1-2, pp. 1–210, 2021. doi: 10.1561/2200000083
[16]	Z. Y. Du, C. Wu, T. Yoshinaga, et al., “Federated learning for vehicular internet of things: Recent advances and open issues,” IEEE Open Journal of the Computer Society, vol. 1, pp. 45–61, 2020. doi: 10.1109/OJCS.2020.2992630
[17]	X. H. Xu, H. Peng, L. C. Sun, et al., “FedMood: Federated learning on mobile health data for mood detection,” arXiv preprint, arXiv: 2102.09342, 2021.
[18]	D. Rengarajan, N. Ragothaman, D. Kalathil, et al., “Federated ensemble-directed offline reinforcement learning,” arXiv preprint, arXiv: 2305.03097, 2023.
[19]	A. Kumar, J. Fu, G. Tucker, et al., “Stabilizing off-policy Q-learning via bootstrapping error reduction,” in Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, 2019.
[20]	Y. F. Wu, G. Tucker, and O. Nachum, “Behavior regularized offline reinforcement learning,” arXiv preprint, arXiv: 1911.11361, 2019.
[21]	N. Jaques, A. Ghandeharioun, J. H. Shen, et al., “Way off-policy batch deep reinforcement learning of implicit human preferences in dialog,” arXiv preprint, arXiv: 1907.00456, 2019.
[22]	I. Kostrikov, R. Fergus, J. Tompson, et al., “Offline reinforcement learning with fisher divergence critic regularization,” in Proceedings of the 38th International Conference on Machine Learning, Online, pp. 5774–5783, 2021.
[23]	R. Kidambi, A. Rajeswaran, P. Netrapalli, et al., “MOReL: Model-based offline reinforcement learning,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, article no. 1830, 2020.
[24]	C. Cang, A. Rajeswaran, P. Abbeel, et al., “Behavioral priors and dynamics models: Improving performance and domain transfer in offline RL,” arXiv preprint, arXiv: 2106.09119, 2021.
[25]	T. Matsushima, H. Furuta, Y. Matsuo, et al., “Deployment-efficient reinforcement learning via model-based offline optimization,” in Proceedings of the 9th International Conference on Learning Representations, Online, 2021.
[26]	T. Hishinuma and K. Senda, “Weighted model estimation for offline model-based reinforcement learning,” in Proceedings of the 35th Conference on Neural Information Processing Systems, Online, pp. 17789–17800, 2021.
[27]	R. Agarwal, D. Schuurmans, and M. Norouzi, “An optimistic perspective on offline reinforcement learning,” in Proceedings of the 37th International Conference on Machine Learning, Online, pp. 104–114, 2020.
[28]	N. Jiang and L. H. Li, “Doubly robust off-policy value evaluation for reinforcement learning,” in Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, pp. 652–661, 2016.
[29]	R. Y. Zhang, B. Dai, L. H. Li, et al., “GenDICE: Generalized offline estimation of stationary values,” in Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
[30]	A. Sonabend-W, N. Laha, A. N. Ananthakrishnan, et al., “Semi-supervised off policy reinforcement learning,” arXiv preprint, arXiv: 2012.04809, 2021.
[31]	T. Y. Chen, K. Q. Zhang, G. B. Giannakis, et al., “Communication-efficient policy gradient methods for distributed reinforcement learning,” arXiv preprint, arXiv: 1812.03239, 2021.
[32]	L. Pan, L. B. Huang, T. Y. Ma, et al., “Plan better amid conservatism: Offline multi-agent reinforcement learning with actor rectification,” in Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, pp. 17221–17237, 2022.
[33]	C. Nadiger, A. Kumar, and S. Abdelhak, “Federated reinforcement learning for fast personalization,” in Proceedings of the 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering, Sardinia, Italy, pp. 123–127, 2019.
[34]	A. Anwar and A. Raychowdhury, “Multi-task federated reinforcement learning with adversaries,” arXiv preprint, arXiv: 2103.06473, 2021.
[35]	B. Y. Liu, L. J. Wang, and M. Liu, “Lifelong federated reinforcement learning: A learning architecture for navigation in cloud robotic systems,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4555–4562, 2019. doi: 10.1109/LRA.2019.2931179
[36]	H. K. Lim, J. B. Kim, J. S. Heo, et al., “Federated reinforcement learning for training control policies on multiple IoT devices,” Sensors, vol. 20, no. 5, article no. 1359, 2020. doi: 10.3390/s20051359
[37]	X. L. Liang, Y. Liu, T. J. Chen, et al., “Federated transfer reinforcement learning for autonomous driving,” arXiv preprint, arXiv: 1910.06001, 2019.
[38]	H. Cha, J. Park, H. Kim, et al., “Federated reinforcement distillation with proxy experience memory,” arXiv preprint, arXiv: 1907.06536, 2020.
[39]	A. A. Rusu, S. G. Colmenarejo, C. Gülçehre, et al., “Policy distillation,” in Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
[40]	H. H. Zhuo, W. F. Feng, Y. F. Lin, et al., “Federated deep reinforcement learning,” arXiv preprint, arXiv: 1901.08277, 2020.
[41]	D. D. Zhou, Y. F. Zhang, A. Sonabend-W, et al., “Federated offline reinforcement learning,” arXiv preprint, arXiv: 2206.05581, 2024.
[42]	T. Li, A. K. Sahu, A. Talwalkar, et al., “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020. doi: 10.1109/MSP.2020.2975749
[43]	T. Lin, L. J. Kong, S. U. Stich, et al., “Ensemble distillation for robust model fusion in federated learning,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, article no. 198, 2020.
[44]	X. C. Li, J. L. Tang, S. M. Song, et al., “Avoid overfitting user specific information in federated keyword spotting,” arXiv preprint, arXiv: 2206.08864, 2022.
[45]	T. Li, A. K. Sahu, M. Zaheer, et al., “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems 2020, Austin, TX, USA, pp. 429–450, 2020.
[46]	B. McMahan, E. Moore, D. Ramage, et al., “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, pp. 1273–1282, 2017.
[47]	A. Reisizadeh, A. Mokhtari, H. Hassani, et al., “FedPAQ: A communication-efficient federated learning method with periodic averaging and quantization,” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Palermo, Sicily, Italy, pp. 2021–2031, 2020.
[48]	J. Fu, A. Kumar, O. Nachum, et al., “D4RL: Datasets for deep data-driven reinforcement learning,” arXiv preprint, arXiv: 2004.07219, 2021.
[49]	E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, pp. 5026–5033, 2012.