查看“︁时序差分学习”︁的源代码

{{NoteTA|G1=IT}}
{{机器学习导航栏}}
'''时序差分学习'''（{{lang-en|Temporal difference learning}}，'''TD learning'''）是一类无模型[[强化学习]]方法的统称，这种方法强调通过从当前价值函数的估值中自举的方式进行学习。这一方法需要像[[蒙特卡罗方法]]那样对环境进行取样，并根据当前估值对价值函数进行更新，宛如[[动态规划]]算法。{{sfnp|Sutton|Barto|2018|p=133}}

和蒙特卡罗法所不同的是，时序差分学习可以在最终结果出来前对其参数进行不断地调整，使其预测更为准确，而蒙特卡罗法只能在最终结果产生后进行调整。<ref name="RSutton-1988">{{cite journal |last1=Sutton |first1=Richard S. |title=Learning to predict by the methods of temporal differences |journal=Machine Learning |date=1988-08-01 |volume=3 |issue=1 |pages=9–44 |doi=10.1007/BF00115009 |url=https://link.springer.com/article/10.1007/BF00115009 |accessdate=2023-04-04 |language=en |issn=1573-0565 |archive-date=2023-03-31 |archive-url=https://web.archive.org/web/20230331130048/https://link.springer.com/article/10.1007/BF00115009 |dead-url=no }}</ref>这是一种自举式的算法，具体的例子如下：
<blockquote>假設你需要預測星期六的天氣，並且手頭上正好有相關的模型。按照一般的方法，你只有到星期六才能根據結果對你的模型進行調整。然而，當到了星期五時，你應該對星期六的天氣有很好的判斷。因此在星期六到來之前，你就能夠調整你的模型以預測星期六的天氣。<ref name="RSutton-1988" /></blockquote>

时序差分学习与[[动物学]]领域中的[[动物认知]]存在一定的关联。<ref name="WSchultz-1997">{{cite journal|author=Schultz, W, Dayan, P & Montague, PR.|year=1997|title=A neural substrate of prediction and reward|journal=Science|volume=275|issue=5306|pages=1593–1599|doi=10.1126/science.275.5306.1593|pmid=9054347|citeseerx=10.1.1.133.6176|s2cid=220093382 }}</ref><ref name=":0">{{Cite journal|last1=Montague|first1=P. R.|last2=Dayan|first2=P.|last3=Sejnowski|first3=T. J.|date=1996-03-01|title=A framework for mesencephalic dopamine systems based on predictive Hebbian learning|journal=The Journal of Neuroscience|volume=16|issue=5|pages=1936–1947|issn=0270-6474|pmid=8774460|pmc=6578666|doi=10.1523/JNEUROSCI.16-05-01936.1996|url=http://papers.cnl.salk.edu/PDFs/A%20Framework%20for%20Mesencephalic%20Dopamine%20Systems%20Based%20on%20Predictive%20Hebbian%20Learning%201996-2938.pdf|access-date=2023-04-04|archive-date=2018-07-21|archive-url=https://web.archive.org/web/20180721221806/http://papers.cnl.salk.edu/PDFs/A%20Framework%20for%20Mesencephalic%20Dopamine%20Systems%20Based%20on%20Predictive%20Hebbian%20Learning%201996-2938.pdf|dead-url=no}}</ref><ref name=":1">{{Cite journal|last1=Montague|first1=P.R.|last2=Dayan|first2=P.|last3=Nowlan|first3=S.J.|last4=Pouget|first4=A.|last5=Sejnowski|first5=T.J.|date=1993|title=Using aperiodic reinforcement for directed self-organization|url=http://www.gatsby.ucl.ac.uk/~dayan/papers/mdnps93.pdf|journal=Advances in Neural Information Processing Systems|volume=5|pages=969–976|access-date=2023-04-04|archive-date=2006-03-12|archive-url=https://web.archive.org/web/20060312111720/http://www.gatsby.ucl.ac.uk/~dayan/papers/mdnps93.pdf|dead-url=no}}</ref><ref name=":2">{{Cite journal|last1=Montague|first1=P. R.|last2=Sejnowski|first2=T. J.|date=1994|title=The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms|journal=Learning & Memory|volume=1|issue=1|pages=1–33|doi=10.1101/lm.1.1.1 |issn=1072-0502|pmid=10467583|s2cid=44560099 |doi-access=free}}</ref><ref name=":3">{{Cite journal|last1=Sejnowski|first1=T.J.|last2=Dayan|first2=P.|last3=Montague|first3=P.R.|date=1995|title=Predictive hebbian learning|journal=Proceedings of Eighth ACM Conference on Computational Learning Theory|pages=15–18|doi=10.1145/225298.225300|isbn=0897917235|s2cid=1709691 |doi-access=free}}</ref>
== 数学模型 ==
<math>TD(0)</math>表格法是最简单的时序差分学习法之一，为随即近似法的一个特例。这种方法用于估计在策略<math>\pi</math>之下有限状态[[马尔可夫决策过程]]的状态价值函数。现用<math>V^\pi</math>表示马尔可夫决策过程的状态价值函数，其中涉及到状态<math>(s_t)_{t\in\mathbb{N}}</math>、奖励<math>(r_t)_{t\in\mathbb{N}}</math>、学习折扣率<math>\gamma</math>以及策略<math> \pi </math>{{sfnp|Sutton|Barto|2018|p=134}}：
:<math>V^\pi(s) = E_{a \sim \pi}\left\{\sum_{t=0}^\infty \gamma^tr_t(a_t)\Bigg|  s_0=s\right\}.
</math>

为了方便起见，我们将上述表达式中表示动作的符号去掉，所得<math>V^\pi</math>满足[[哈密顿-雅可比-贝尔曼方程]]：
: <math>V^\pi(s)=E_{\pi}\{r_0 + \gamma V^\pi(s_1)|s_0=s\},</math>

因此<math>r_0 + \gamma V^\pi(s_1)</math>乃是<math>V^\pi(s)</math>的无偏估计，基于这一观察结果可以设计用于估计<math>V^\pi</math>的算法。在这一算法中，首先用任意值对表格<math>V(s)</math>进行初始化，使马尔可夫决策过程中的每个状态都有一个对应值，并选择一个正的学习率<math>\alpha</math>。我们接下来要做的便是反复对策略<math>\pi</math>进行评估，并根据所获得的奖励<math>r</math>按照如下方式对旧状态下的价值函数进行更新{{sfnp|Sutton|Barto|2018|p=135}}：
:<math> V(s) \leftarrow V(s) + \alpha(\overbrace{r + \gamma V(s')}^{\text{The TD target}} - V(s) )</math>
其中<math>s</math>和<math>s'</math>分别表示新旧状态，而<math> r + \gamma V(s')</math>便是所谓的TD目标（TD target）。

== TD-λ算法 ==
TD-λ算法是理查德·S·萨顿基于[[亚瑟·李·塞谬尔]]的时序差分学习早期研究成果而创立的算法，这一算法最著名的应用是杰拉尔德·特索罗开发的TD-Gammon程序。该程序可以用于学习[[双陆棋]]对弈，甚至能够到达人类专家水准。<ref>{{cite journal |last1=Tesauro |first1=Gerald |title=Temporal difference learning and TD-Gammon |journal=Communications of the ACM |date=1995-03-01 |volume=38 |issue=3 |pages=58–68 |doi=10.1145/203330.203343 |url=https://dl.acm.org/doi/10.1145/203330.203343 |accessdate=2023-04-06 |issn=0001-0782 |archive-date=2023-04-06 |archive-url=https://web.archive.org/web/20230406054821/https://dl.acm.org/doi/10.1145/203330.203343 |dead-url=no }}</ref>这一算法中的<math>\lambda</math>值为迹线衰减参数，介于0和1之间。当<math>\lambda</math>越大时，很久之后的奖励将越被重视。当<math>\lambda=1</math>时，将会变成与蒙特卡罗强化学习算法并行的学习算法。{{sfnp|Sutton|Barto|2018|p=175}}
== 在神经科学领域 ==
时序差分学习算法在[[神经科学]]领域亦得到了重视。研究人员发现[[腹侧被盖区]]与[[黑质]]中[[多巴胺]][[神经元]]的放电率和时序差分学习算法中的误差函数具有相似之处<ref name="WSchultz-1997"/><ref name=":0" /><ref name=":1" /><ref name=":2" /><ref name=":3" />，该函数将会回传任何给定状态或时间步长的估计[[犒赏系统|奖励]]与实际收到奖励之间的差异。当误差函数越大时，这意味着预期奖励与实际奖励之间的差异也就越大。

多巴胺细胞的行为也和时序差分学习存在相似之处。在一次实验中，研究人员训练一只猴子将刺激与果汁奖励联系起来，并对多巴胺细胞的表现进行了测量。<ref name="WSchultz-1998">{{cite journal |author=Schultz, W. |year=1998 |title=Predictive reward signal of dopamine neurons |journal=Journal of Neurophysiology |volume=80 |issue=1 |pages=1–27|doi=10.1152/jn.1998.80.1.1 |pmid=9658025 |citeseerx=10.1.1.408.5994 |s2cid=52857162 }}</ref>一开始猴子接受果汁时，其多巴胺细胞的放电率会增加，这一结果表明预期奖励和实际奖励存在差异。不过随着训练次数的增加，预期奖励也会发生变化，导致其巴胺细胞的放电率不再显著增加。而当没有获得预期奖励时，其多巴胺细胞的放电率会降低。由此可以看出，这一特征与时序差分学习中的[[误差函数]]有着相似之处。

目前很多关于神经功能的研究都是建立在时序差分学习的基础之上的<ref name="PDayan-2001">{{cite journal |author=Dayan, P. |year=2001 |title=Motivated reinforcement learning |journal=Advances in Neural Information Processing Systems |volume=14 |pages=11–18 |publisher=MIT Press |url=http://books.nips.cc/papers/files/nips14/CS01.pdf |access-date=2023-04-11 |archive-date=2012-05-25 |archive-url=https://web.archive.org/web/20120525163926/http://books.nips.cc/papers/files/nips14/CS01.pdf |dead-url=yes }}</ref><ref>{{Cite journal |last=Tobia, M. J., etc. |date=2016 |title=Altered behavioral and neural responsiveness to counterfactual gains in the elderly |journal=Cognitive, Affective, & Behavioral Neuroscience |volume=16 |issue=3 |pages=457–472|doi=10.3758/s13415-016-0406-7 |pmid=26864879 |s2cid=11299945 |doi-access=free }}</ref>，这一方法还被用于对[[精神分裂症]]的治疗及研究多巴胺的药理学作用。<ref name="ASmith-2006">{{cite journal |author=Smith, A., Li, M., Becker, S. and Kapur, S. |year=2006 |title=Dopamine, prediction error, and associative learning: a model-based account |journal=Network: Computation in Neural Systems |volume=17 |issue=1 |pages=61–84 |doi=10.1080/09548980500361624 |pmid=16613795|s2cid=991839 }}</ref>

== 参考文献 ==
{{reflist|30em}}
===参考著作===
*{{cite book |title=Reinforcement Learning: An Introduction |first1=Richard S. |last1=Sutton |first2=Andrew G. |last2=Barto |edition=2nd |publisher=MIT Press |place=Cambridge, MA |year=2018 |url=http://www.incompleteideas.net/book/the-book.html |ref={{sfnRef|Sutton|Barto|2018}} |access-date=2023-04-04 |archive-date=2023-04-26 |archive-url=https://web.archive.org/web/20230426022549/http://incompleteideas.net/book/the-book.html |dead-url=no }}

==延伸阅读==
* {{cite book |first=S. P. |last=Meyn |year=2007 |title=Control Techniques for Complex Networks |publisher=Cambridge University Press |isbn=978-0521884419 |ref=none}} See final chapter and appendix.
* {{cite journal |last1=Sutton |first1=R. S. |last2=Barto |first2=A. G. |year=1990 |title=Time Derivative Models of Pavlovian Reinforcement |journal=Learning and Computational Neuroscience: Foundations of Adaptive Networks |pages=497–537 |url=http://incompleteideas.net/sutton/papers/sutton-barto-90.pdf |ref=none |access-date=2023-04-06 |archive-date=2017-03-30 |archive-url=https://web.archive.org/web/20170330003906/http://incompleteideas.net/sutton/papers/sutton-barto-90.pdf |dead-url=no }}
==外部链接==
* [http://pitoko.net/tdgravity Connect Four TDGravity Applet] {{Wayback|url=http://pitoko.net/tdgravity |date=20120724150820 }} (+ mobile phone version) – self-learned using TD-Leaf method (combination of TD-Lambda with shallow tree search)
* [http://chet-weger.herokuapp.com/learn_meta_ttt/ Self Learning Meta-Tic-Tac-Toe] {{Wayback|url=http://chet-weger.herokuapp.com/learn_meta_ttt/ |date=20140319223951 }} Example web app showing how temporal difference learning can be used to learn state evaluation constants for a minimax AI playing a simple board game.
* [https://web.archive.org/web/20131116084228/http://www.cs.colorado.edu/~grudic/teaching/CSCI4202/RL.pdf Reinforcement Learning Problem], document explaining how temporal difference learning can be used to speed up Q-learning
* [https://www.cal-r.org/index.php?id=TD-sim TD-Simulator] {{Wayback|url=https://www.cal-r.org/index.php?id=TD-sim |date=20230404150028 }} Temporal difference simulator for classical conditioning
[[Category:计算神经科学]]
[[Category:減法]]
[[Category:强化学习]]