CHEN Xingguo, GAO Yang, FAN Shunguo, “Temporal Difference Learning with Piecewise Linear Basis,” Chinese Journal of Electronics, vol. 23, no. 1, pp. 49-54, 2014,
Citation: CHEN Xingguo, GAO Yang, FAN Shunguo, “Temporal Difference Learning with Piecewise Linear Basis,” Chinese Journal of Electronics, vol. 23, no. 1, pp. 49-54, 2014,

Temporal Difference Learning with Piecewise Linear Basis

Funds:  This work is supported in part by the National Science Foundation of China (No.61035003, No.61175042, No.60721002), the National 973 Program of China (No.2009CB320702), the 973 Program of Jiangsu, China (No.BK2011005), and Program for New Century Excellent Talents in University (No.NCET-10-0476).
  • Received Date: 2012-12-01
  • Rev Recd Date: 2013-01-01
  • Publish Date: 2014-01-05
  • Temporal difference (TD) learning family tries to learn a least-squares solution of an approximate Linear value function (LVF) to deal with large scale and/or continuous reinforcement learning problems. However, due to the represented ability of the features in LVF, the predictive error of the learned LVF is bounded by the residual between the optimal value function and the projected optimal value function. In this paper, Temporal difference learning with Piecewise linear basis (PLB-TD) is proposed to further decrease the error bounds. In PLBTD, there are two steps: (1) build the piecewise linear basis for problems with different dimensions; (2) learn the parameters via some famous members from the TD learning family (linear TD, GTD, GTD2 or TDC), which complexity is O(n). The error bounds are proved to decrease to zero when the size of the piecewise basis goes into infinite. The empirical results demonstrate the effectiveness of the proposed algorithm.
  • loading
  • R. Sutton and A. Barto, Reinforcement Learning: An Introduction, Cambridge, MA: MIT Press, 1998.
    S. Dˇzeroski, L. De Raedt et al.,"Relational reinforcement learning", Machine Learning, Vol.43, No.1, pp.7-52, 2001.
    F. Wang, N. Jin, D. Liu and Q. Wei,"Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ∈-error bound", IEEE Transactioin on Neural Network, Vol.22, No.1, pp.1-13, 2011.
    Y. Cheng, X. Wang and Y. Zhang,"A bayesian reinforcement learning algorithm based on abstract states for elevator group scheduling systems", Chinese Journal of Electronics, Vol.19, No.3, pp.394-398, 2010.
    X. Xu, D. Hu and X. Lu,"Kernel-based least squares policy iteration for reinforcement learning", IEEE Transactioin on Neural Network, Vol.18, No.7, pp.973-992, 2007.
    R. Sutton,"Learning to predict by the methods of temporal differences", Machine Learning, Vol.3, No.1, pp.9-44, 1988.
    J. Boyan,"Technical update: Least-squares temporal difference learning", Machine Learning, Vol.49, No.2, pp.233-246, 2002.
    A. Geramifard, M. Bowling and R. Sutton,"Incremental leastsquares temporal difference learning", in Proceedings of the 21st AAAI Conference on Artificial Intelligence, Boston, Massachusetts, pp.356-361, 2006.
    R. Sutton et al.,"A convergent O(n) algorithm for off-policy temporal difference learning with linear function approximation", in Advances in Neural Information Processing Systems 21, Vancouver, B.C., Canada, pp.1609-1616, 2008.
    R. Sutton, H. Maei et al.,"Fast gradient descent methods for temporal difference learning with linear function approximation", in Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, pp.993-1000, 2009.
    J. Tsitsiklis and B. Van Roy,"An analysis of temporal difference learning with function approximation", IEEE Transactions on Automatic Control, Vol.42, No.5, pp.674-690, 1997.
    R. Schoknecht,"Optimality of reinforcement learning algorithms with linear function approximation", in Advances in Neural Information Processing Systems 15, Vancouver, B.C., Canada, pp.1555-1562, 2002.
    B. Scherrer,"Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view", in Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, pp.959-966, 2010.
    M. Ghavamzadeh, A. Lazaric et al.,"LSTD with random projections", in Advances in Neural Information Processing Systems 23, Lake Tahoe, Nevada, USA, pp.721-729, 2010.
    D. Bertsekas,"Temporal difference methods for general projected equations", IEEE Transactions on Automatic Control, Vol.56, No.9, pp.2128-2139, 2011.
    R. Parr, C. Painter-Wakefield, L. Li and M. Littman,"Analyzing feature generation for value function approximation", in Proceedings of the 24th International Conference on Machine Learning, Corvallis, Oregon: ACM, pp.737-744, 2007.
    M. Loth et al.,"Sparse temporal difference learning using LASSO", in IEEE Symp. on Adaptive Dynamic Programming and Reinforcement Learning, Honolulu, pp.352-359, 2007.
    J. Kolter and A. Ng,"Regularization and feature selection in least-squares temporal difference learning", in Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, pp.521-528, 2009.
    M. Hoffman et al.,"Regularized least squares temporal difference learning with nested l2 and l1 penalization", in Proceedings of European Workshop on Reinforcement Learning, Athens, Greece, pp.102-114, 2011.
    M. Ghavamzadeh et al.,"Finite sample analysis of Lasso-TD", in Proceedings of the 28th International Conference on Machine Learning, Bellevue, Washington, USA, pp.1177-1184, 2011.
    M. Geist et al.,"A dantzig selector approach to temporal difference learning", in Proceedings of the 29th International Conf. Machine Learning, Edinburgh, Scotland, pp.1399-1406, 2012.
    C. Phua and R. Fitch,"Tracking value function dynamics to improve reinforcement learning with piecewise linear function approximation", in Proceedings of the 24th International Conference on Machine Learning, Corvallis, Oregon, USA, pp.751758, 2007.
    L. Zhao and J. Ding,"Least squares approximation to lognormal sum distribution via piecewise linear functions", in Proceedings of the 4th IEEE Conference on Industrial Electronics and Applications, Xi'an, China, pp.1324-1329, 2009.
    D. Bertsekas, J. Tsitsiklis and A. Scientific, Neuro-Dynamic Programming, Athens, Greece: Athena Scientific Press, 1996.
    A. Dutech et al.,"Reinforcement learning benchmarks and bake-offs II", in Workshop of 17th Advances in Neural Information Processing Systems, Vancouver, B.C., Canada, 2005.
    N. Ferns, P. Castro, D. Precup and P. Panangaden,"Methods for computing state similarity in markov decision processes", in Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Cambridge, MA, USA, pp.174-181, 2006.
    N. Ferns, P. Panangaden and D. Precup,"Bisimulation metrics for continuous markov decision processes", SIAM Journal of Computing, Vol.40, No.6, pp.1662-1714, 2011.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (579) PDF downloads(1654) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return