查看“︁SARSA算法”︁的源代码

{{机器学习导航栏|强化学习}}

'''SARSA算法'''是[[机器学习]]领域的一种[[强化学习]]算法，得名于“状态-动作-奖励-状态-动作”（'''S'''tate–'''A'''ction–'''R'''eward–'''S'''tate–'''A'''ction）的英文首字母缩写。

SARSA算法最早是由G.A. Rummery, M. Niranjan在1994年提出的，当时称为“改进型[[联结主义]]Q学习”（Modified Connectionist Q-Learning）。<ref>{{Cite web |url=http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2539&rep=rep1&type=pdf |title=Online Q-Learning using Connectionist Systems" by Rummery & Niranjan (1994) |access-date=2022-07-14 |archive-date=2013-06-08 |archive-url=https://web.archive.org/web/20130608043102/http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2539&rep=rep1&type=pdf |dead-url=no }}</ref>{{link-en|Richard S. Sutton}}提出了使用替代名SARSA。<ref>{{cite web
|first=Nivash
|last=Jeevanandam
|date=2021-09-13
|title=Underrated But Fascinating ML Concepts #5 – CST, PBWM, SARSA, & Sammon Mapping
|url=https://analyticsindiamag.com/underrated-but-fascinating-ml-concepts-5-cst-pbwm-sarsa-sammon-mapping/
|access-date=2021-12-05
|website=Analytics India Magazine
|language=en
|archive-date=2021-12-05
|archive-url=https://web.archive.org/web/20211205175229/https://analyticsindiamag.com/underrated-but-fascinating-ml-concepts-5-cst-pbwm-sarsa-sammon-mapping/
|dead-url=no
}}</ref>

SARSA算法和[[Q学习]]算法的区别主要在期望奖励Q值的更新方法上。SARSA算法使用五[[多元组|元组]](s<sub>t</sub>, a<sub>t</sub>, r<sub>t</sub>, s<sub>t+1</sub>, a<sub>t+1</sub>)来进行更新，其中s、a、r分别为[[马可夫决策过程]]（MDP）中的状态、动作、奖励，t和t+1分别为当前步和下一步。<ref>{{cite web |author1=Richard S. Sutton and Andrew G. Barto |title=Sarsa: On-Policy TD Control |url=http://incompleteideas.net/book/ebook/node64.html |website=Reinforcement Learning:  An Introduction |access-date=2022-07-14 |archive-date=2020-07-05 |archive-url=https://web.archive.org/web/20200705201035/http://www.incompleteideas.net/book/ebook/node64.html |dead-url=no }}</ref>

==算法==

 '''for each''' ''step'' '''in''' ''episode''
  执行动作 <math>a_{t}</math>，观察奖励 <math>r_{t}</math> 和下一步状态 <math>s_{t+1}</math>
  基于当前的 <math>Q</math> 和 <math>s_{t+1}</math>，根据特定策略（如ε-[[贪心算法|greedy]]）选择 <math>a_{t+1}</math>
  <math>Q^{new}(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha \, [r_{t} + \gamma \, Q(s_{t+1}, a_{t+1})-Q(s_t,a_t)]</math>
  <math>s_{t} \leftarrow s_{t+1}</math>；<math>a_{t} \leftarrow a_{t+1}</math>
 '''until''' 状态 <math>s</math> 终止

在选择下一步动作<math>a_{t+1}</math>时，采用ε-[[贪心算法|greedy]]策略，即：
* 以 ε 的概率随机选择下一个动作
* 以 1-ε 的概率选择可以最大化<math>Q(s_{t+1}, a_{t+1})</math>的下一个动作

在该算法中，[[超参数 (机器学习)|超参数]] <math>\alpha</math> 为[[学习率|学习速率]]，<math>\gamma</math> 为折扣因子。

在更新<math>Q</math>时，对比[[Q学习]]使用 <math>\text{max}_a Q(s_{t+1}, a)</math> 作为预估，SARSA则使用 <math>Q(s_{t+1}, a_{t+1})</math> 作为预估。<ref>{{cite web |author1=TINGWU WANG |title=Tutorial of Reinforcement: A Special Focus on Q-Learning |url=https://www.cs.toronto.edu/~jlucas/teaching/csc411/lectures/tut11_handout.pdf |website=cs.toronto |access-date=2022-07-14 |archive-date=2022-07-14 |archive-url=https://web.archive.org/web/20220714012033/https://www.cs.toronto.edu/~jlucas/teaching/csc411/lectures/tut11_handout.pdf |dead-url=no }}</ref>一些针对Q学习的提出优化方法也可以应用于SARSA上。<ref>{{Cite journal |last=Wiering |first=Marco |last2=Schmidhuber |first2=Jürgen |title=Fast Online Q(λ) |url=https://link.springer.com/article/10.1023/A:1007562800292 |journal=Machine Learning |language=en |date=1998-10-01 |volume=33 |issue=1 |page=105–115 |doi=10.1023/A:1007562800292 |issn=1573-0565 |s2cid=8358530}}</ref>

==相关条目==
* [[强化学习]]
* [[Q学习]]
* [[馬可夫決策過程]]

==参考文献==
{{Reflist}}

{{Differentiable_computing}}
[[Category:機器學習演算法]]