查看“︁深度强化学习”︁的源代码

{{expand|time=2018-06-25T11:52:28+00:00}}
'''深度強化學習'''（英語：Deep reinforcement learning，簡稱 Deep RL 或 DRL）是[[機器學習]]的一個子領域，結合了[[強化學習]]和[[深度學習]]。強化學習探討如何在嘗試錯誤的過程中讓[[智能代理|智慧型代理人]]學習做更好的決策。深度強化學習採用了深度學習的方法，讓智慧型代理人可以直接基於非結構化資料來做決策，而不需要人為設計的[[狀態空間]]。深度強化學習演算法可以讀取非常大的輸入資料（像是電玩畫面上的每個像素），來判斷哪個動作可以達到最好的目標（像是最高的遊戲分數）。深度強化學習已經有了廣泛的應用，包括[[機器人學]]、[[電動遊戲]]、[[自然語言處理]]、[[電腦視覺]]、教育、交通運輸、金融、[[醫療衛生]]等等。<ref name="francoislavet2018"/>

==概述==

===深度學習===
[[深度學習]]是[[機器學習]]的一種，訓練[[人工神經網路]]來將一組輸入轉換成一組特定的輸出。深度學習常常以[[監督式學習]]的形式，用帶有標籤的資料集來做訓練。深度學習的方法可以直接處理高維度、複雜的原始輸入資料，相較於之前的方法更不需要人為的[[特徵工程]]從輸入資料中提取特徵。因此，深度學習已經在[[電腦視覺]]、[[自然語言處理]]等領域上帶來突破性的進展。

===強化學習===
強化學習是讓智慧型代理人和環境互動，從中嘗試錯誤以學習做出更好的決策。這類的問題在數學上常常用[[馬可夫決策過程]]表示：在每個時間點，代理人處在環境的一個狀態 <math>s</math>，在代理人採取了一個動作 <math>a</math> 之後，會收到一個獎勵 <math>r</math>，並根據環境的狀態轉移函數 <math>p(s'|s, a)</math> 轉移到下一個狀態 <math>s'</math>。代理人的目標是學習一組策略 <math>\pi(a|s)</math> （也就是一組從當前的狀態到所要採取的動作之間的對應關係），使得獲得到的總獎勵最大。與[[最佳控制]]不同，強化學習的演算法只能透過抽樣的方式來探測狀態轉移函數 <math>p(s'|s, a)</math>。

===深度強化學習===
在很多現實中的決策問題裡，[[馬可夫決策過程]]的狀態 <math>s</math> 的維度很高（例如：相機拍下的照片、機器人感測器的串流），限制了傳統強化學習方法的可行性。深度強化學習就是利用深度學習的技術來解決強化學習中的決策問題，訓練人工神經網路來表示策略 <math>\pi(a|s)</math>，並針對這樣的訓練場景開發特化的演算法。<ref name="DQN2"/>

==演算法==

如今已經有不少深度強化學習演算法來訓練決策模型，不同的演算法之間各有優劣。粗略來說，深度強化學習演算法可以依照是否需要建立環境動態模型分為兩類：

* '''模型基底'''深度強化學習演算法：建立類神經網路模型來預測環境的獎勵函數 <math>r(s, a)</math> 和狀態轉移函數 <math>p(s'|s, a)</math>，而這些類神經網路模型可以用[[監督式學習]]的方法來訓練。在訓練好環境模型之後，可以用[[模型預測控制]]的方法來建立策略 <math>\pi(a|s)</math>。然而，因為環境模型不一定能完美地預測真實環境，代理人和環境互動的過程中常常需要重新規劃動作。另外，也可以用[[蒙地卡羅樹搜尋]]或{{link-en|交叉熵方法|Cross-entropy method}}來依據訓練好的環境模型規劃動作。

* '''無模型'''深度強化學習演算法：直接訓練類神經網路模型來表示策略 <math>\pi(a|s)</math>。這裡的「無模型」指的是不建立環境模型，而非不建立任何機器學習模型。這樣的策略模型可以直接用策略梯度（policy gradient）<ref name="williams1992"/>訓練，但是策略梯度的變異性太大，很難有效率地進行訓練。更進階的訓練方法嘗試解決這個穩定性的問題：可信區域策略最佳化（Trust Region Policy Optimization，TRPO）<ref name="schulman2015trpo"/>、近端策略最佳化（Proximal Policy Optimization，PPO）<ref name="schulman2017ppo"/>。另一系列的無模型深度強化學習演算法則是訓練類神經網路模型來預測未來的獎勵總和 <math>V^{\pi}(s)</math> 或 <math>Q^{\pi}(s, a)</math><ref name="DQN1"/>，這類演算法包括[[時序差分學習]]、[[Q學習|深度Q學習]]、[[SARSA算法|SARSA]]。如果動作空間是離散的，那麽策略 <math>\pi(a|s)</math> 可以用枚舉所有的動作來找出 <math>Q</math> 函數的最大值。如果動作空間是連續的，這樣的 <math>Q</math> 函數無法直接建立策略 <math>\pi(a|s)</math> ，因此需要同時訓練一個策略模型<ref name="lillicrap2015ddpg"/><ref name="mnih2016a3c"/><ref name="haarnoja2018sac"/>，也就變成一種「演員－評論家」演算法。

== 应用 ==

=== 游戏 ===
<div class="_1BN1N Kzi1t _2DJZN" style="z-index: 2; transform: translate(354.833px, 754.906px);"><div class="_1HjH7"></div></div>

* [[围棋]]：[[AlphaGo]]
* [[國際象棋]]

<div class="_1BN1N Kzi1t BD-0J _7_mnr _2DJZN" style="z-index: 2; transform: translate(371.833px, 820.635px);"><div class="_1HjH7"></div></div>

=== 机器人技术 ===
<div class="_1BN1N Kzi1t _2DJZN" style="z-index: 2; transform: translate(354.833px, 872.635px);"><div class="_1HjH7"></div></div>

* 机器人规划

=== 智能城市 ===

* 室内定位<ref name="mohammadi2018semi"/>
* 智能运输

== 参阅 ==

* [[强化学习]]
* [[Q学习]]
* [[SARSA算法]]
* [[深度学习]]

== 参考文献 ==


{{Reflist|refs=
<ref name="DQN1">{{cite conference |first= Volodymyr |display-authors= etal |last= Mnih |date= December 2013 |title= Playing Atari with Deep Reinforcement Learning |url= https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf |conference= NIPS Deep Learning Workshop 2013 |access-date= 2021-12-15 |archive-date= 2014-09-12 |archive-url= https://web.archive.org/web/20140912094917/https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf |dead-url= no }}</ref>
<ref name="DQN2">{{cite journal |first= Volodymyr|display-authors=etal|last= Mnih |year=2015 |title= Human-level control through deep reinforcement learning |journal=Nature|volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236|pmid=25719670|bibcode=2015Natur.518..529M |s2cid=205242740}}</ref>
<ref name="francoislavet2018">{{cite journal|last1=Francois-Lavet|first1=Vincent|last2=Henderson|first2=Peter|last3=Islam|first3=Riashat|last4=Bellemare|first4=Marc G.|last5=Pineau|first5=Joelle|date=2018|title=An Introduction to Deep Reinforcement Learning|journal=Foundations and Trends in Machine Learning|volume=11|issue=3–4|pages=219–354|arxiv=1811.12560|bibcode=2018arXiv181112560F|doi=10.1561/2200000071|issn=1935-8237|s2cid=54434537}}</ref>
<ref name="mohammadi2018semi">{{cite journal|title=Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services|url=https://ieeexplore.ieee.org/document/7945258/|first1=Mehdi|last2=Al-Fuqaha|first2=Ala|journal=IEEE Internet of Things Journal|issue=2|doi=10.1109/JIOT.2017.2712560|year=2018|volume=5|pages=624-635|last3=Guizani|first3=Mohsen|last4=Oh|first4=Jun-Seok|last1=Mohammadi|access-date=2018-06-25|archive-date=2019-06-01|archive-url=https://web.archive.org/web/20190601224641/https://ieeexplore.ieee.org/document/7945258/|dead-url=no}}</ref>
<ref name="schulman2015trpo">{{Cite conference|title=Trust Region Policy Optimization|last1=Schulman|first1=John|last2=Levine|first2=Sergey|last3=Moritz|first3=Philipp|last4=Jordan|first4=Michael|last5=Abbeel|first5=Pieter|date=2015|arxiv=1502.05477|conference=International Conference on Machine Learning (ICML)|url=https://arxiv.org/abs/1502.05477|access-date=2021-12-15|archive-date=2022-01-02|archive-url=https://web.archive.org/web/20220102024518/https://arxiv.org/abs/1502.05477|dead-url=no}}</ref>
<ref name="schulman2017ppo">{{Cite conference|title=Proximal Policy Optimization Algorithms|last1=Schulman|first1=John|last2=Wolski|first2=Filip|last3=Dhariwal|first3=Prafulla|last4=Radford|first4=Alec|last5=Klimov|first5=Oleg|date=2017|arxiv=1707.06347|url=https://arxiv.org/abs/1707.06347|access-date=2021-12-15|archive-date=2022-01-02|archive-url=https://web.archive.org/web/20220102024432/https://arxiv.org/abs/1707.06347|dead-url=no}}</ref>
<ref name="williams1992">{{Cite journal|last1=Williams|first1=Ronald J|journal=Machine Learning|pages=229–256|title = Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning|date=1992|volume=8|issue=3–4|doi=10.1007/BF00992696|s2cid=2332513|doi-access=free}}</ref>
<ref name="lillicrap2015ddpg">{{Cite conference|title=Continuous control with deep reinforcement learning|last1=Lillicrap|first1=Timothy|last2=Hunt|first2=Jonathan|last3=Pritzel|first3=Alexander|last4=Heess|first4=Nicolas|last5=Erez|first5=Tom|last6=Tassa|first6=Yuval|last7=Silver|first7=David|last8=Wierstra|first8=Daan|conference=International Conference on Learning Representations (ICLR)|date=2016|arxiv=1509.02971|url=https://arxiv.org/abs/1509.02971|access-date=2021-12-15|archive-date=2022-01-02|archive-url=https://web.archive.org/web/20220102024544/https://arxiv.org/abs/1509.02971|dead-url=no}}</ref>
<ref name="mnih2016a3c">{{Cite conference|title=Asynchronous Methods for Deep Reinforcement Learning|last1=Mnih|first1=Volodymyr|last2=Puigdomenech Badia|first2=Adria|last3=Mirzi|first3=Mehdi|last4=Graves|first4=Alex|last5=Harley|first5=Tim|last6=Lillicrap|first6=Timothy|last7=Silver|first7=David|last8=Kavukcuoglu|first8=Koray|conference=International Conference on Machine Learning (ICML)|date=2016|arxiv=1602.01783|url=https://arxiv.org/abs/1602.01783|access-date=2021-12-15|archive-date=2022-01-08|archive-url=https://web.archive.org/web/20220108120027/https://arxiv.org/abs/1602.01783|dead-url=no}}</ref>
<ref name="haarnoja2018sac">{{Cite conference|title=Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor|last1=Haarnoja|first1=Tuomas|last2=Zhou|first2=Aurick|last3=Levine|first3=Sergey|last4=Abbeel|first4=Pieter|conference=International Conference on Machine Learning (ICML)|date=2018|arxiv=1801.01290|url=https://arxiv.org/abs/1801.01290|access-date=2021-12-15|archive-date=2022-01-02|archive-url=https://web.archive.org/web/20220102194101/https://arxiv.org/abs/1801.01290|dead-url=no}}</ref>
}}
[[Category:机器学习]]
[[Category:强化学习]]