Actor-Critic Learning Based on AdaptiveImportance Sampling

CHENG Yuhu; FENG Huanting; WANG Xuesong

CHENG Yuhu, FENG Huanting, WANG Xuesong. Actor-Critic Learning Based on AdaptiveImportance Sampling[J]. Chinese Journal of Electronics, 2010, 19(4): 583-588.

Citation:

CHENG Yuhu, FENG Huanting, WANG Xuesong. Actor-Critic Learning Based on AdaptiveImportance Sampling[J]. Chinese Journal of Electronics, 2010, 19(4): 583-588.

Citation:

CHENG Yuhu, FENG Huanting, WANG Xuesong. Actor-Critic Learning Based on AdaptiveImportance Sampling[J]. Chinese Journal of Electronics, 2010, 19(4): 583-588.

Actor-Critic Learning Based on AdaptiveImportance Sampling

Graphical Abstract

Graphical Abstract

Abstract

Abstract

For on-policy Actor-critic (AC) reinforcementlearning, sampling is a time-consuming and expensivework. In order to efficiently reuse previously collectedsamples and to reduce large estimation variance, a kindof off-policy AC learning algorithm based on an Adaptiveimportance sampling (AIS) technique is proposed. TheCritic estimates the value-function using the least squarestemporal difference with eligibility trace and the AIS technique.In order to control the trade-off between bias andvariance of the estimation of policy gradient, a flatteningfactor is introduced to the importance weight in the AIS.The value of the flattening factor can be determined by animportance-weight cross-validation method automaticallyfrom samples and policies. Based on the estimated policygradient from the Critic, the Actor updates the policyparameter so as to obtain an optimal control policy. Simulationresults concerning a queuing problem illustrate thatthe AC learning based on AIS not only has good and stablelearning performance but also has quick convergencespeed.