时序差分学习

时序差分学习(Temporal-Difference Learning, TD learning)是强化学习中最核心与最著名的思想
‘If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.’ —Richard S. Sutton& Andrew G. Barto
TD = DP + MC
TD, DP都是使用下一时刻的状态函数来估计当前时刻的状态函数。
TD,MC都是通过经历一次一次与环境互动,产生多个episode来估计状态函数。

TD:

MC:

Sarsa

在策略(On policy)TD控制

Expected Sarsa

离策略(Off-policy)TD控制

Q-learning

离策略(Off-policy)TD控制

Q-learning vs Sarsa

例子:悬崖行走(固定的ε=0.1)

Q-learning的问题

过估计(overestimate),因此产生了Double Q learning

Double Q learning

Q learning vs. Double Q learning

训练参数:ε=0.1,α=0.1, γ=1

------ 本文结束 🎉🎉 谢谢观看 ------
0%