For on-policy Actor-critic (AC) reinforcementlearning, sampling is a time-consuming and expensivework. In order to efficiently reuse previously collectedsamples and to reduce large estimation variance, a kindof off-policy AC learning algorithm based on an Adaptiveimportance sampling (AIS) technique is proposed. TheCritic estimates the value-function using the least squarestemporal difference with eligibility trace and the AIS technique.In order to control the trade-off between bias andvariance of the estimation of policy gradient, a flatteningfactor is introduced to the importance weight in the AIS.The value of the flattening factor can be determined by animportance-weight cross-validation method automaticallyfrom samples and policies. Based on the estimated policygradient from the Critic, the Actor updates the policyparameter so as to obtain an optimal control policy. Simulationresults concerning a queuing problem illustrate thatthe AC learning based on AIS not only has good and stablelearning performance but also has quick convergencespeed.