查看“︁超参数优化”︁的源代码

[[机器学习]]中，'''超参数优化'''<ref>Matthias Feurer and Frank Hutter. [https://link.springer.com/content/pdf/10.1007%2F978-3-030-05318-5_1.pdf Hyperparameter optimization] {{Wayback|url=https://link.springer.com/content/pdf/10.1007%2F978-3-030-05318-5_1.pdf |date=20221218124456 }}. In: ''AutoML: Methods, Systems, Challenges'', pages 3–38.</ref>或'''整定'''（tuning）是为学习算法选择一组最佳[[超参数 (机器学习)|超参数]]的问题。超参数是用于控制学习过程的[[参数]]。

超参数优化会找到能产生最优模型的超参数元组，在给定的独立数据上将预定义的[[损失函数]]最小化。<ref name=abs1502.02127>{{cite arXiv |eprint=1502.02127|last1=Claesen|first1=Marc|title=Hyperparameter Search in Machine Learning|author2=Bart De Moor|class=cs.LG|year=2015}}</ref>目标函数获取超参数元组，返回相关损失。<ref name=abs1502.02127/>[[交叉验证]]常用于估算这种泛化性能，从而为超参数选择一组能使其最大化的值。<ref name="bergstra">{{cite journal|last1=Bergstra|first1=James|last2=Bengio|first2=Yoshua|year=2012|title=Random Search for Hyper-Parameter Optimization|url=http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf|journal=Journal of Machine Learning Research|volume=13|pages=281–305|access-date=2024-06-08|archive-date=2023-11-18|archive-url=https://web.archive.org/web/20231118054005/https://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf|dead-url=no}}</ref>

== 方法 ==
[[File:Hyperparameter Optimization using Grid Search.svg|thumb|在两个超参数值之间进行网格搜索。对每个超参数考虑了10个不同的值，因此共评比了100种不同组合。蓝色等值线表示结果较好的区域，红色表示较差区域。]]

=== 网格搜索 ===
超参数优化的传统方法是网格搜索（grid search）或参数扫描（parameter sweep），即对学习算法超参数空间中人工指定的子集[[暴力搜索]]。网格搜索算法必须以某些性能指标为指导，通常是在训练集上[[交叉验证]]<ref>Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). [http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf A practical guide to support vector classification] {{Wayback|url=http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf |date=20130625201224 }}. Technical Report, [[National Taiwan University]].</ref>或在保持验证集上评估。<ref>{{cite journal 
| vauthors = Chicco D
| title = Ten quick tips for machine learning in computational biology 
| journal = BioData Mining
| volume = 10
| issue =  35
| pages = 35 
| date = December 2017 
| pmid = 29234465
| doi = 10.1186/s13040-017-0155-3
| pmc= 5721660 
| doi-access = free 
}}</ref>

由于机器学习的参数空间可能包括参数的实值空间或无界值空间，因此在网格搜索前可能要手动设置边界与离散化。

例如，配备[[径向基函数核|RBF核]]的典型软边界[[支持向量机|SVM]][[统计分类|分类器]]至少有两个超参数需要调整，以便在未见数据上获得良好性能：正则化常数''C''、核超参数γ。它们都是连续的，因此网格搜索要为每个参数选择一组有限的“合理”值，如

:<math>C \in \{10, 100, 1000\}</math>
:<math>\gamma \in \{0.1, 0.2, 0.5, 1.0\}</math>

然后，网格搜索会以两个集合的[[笛卡尔积]]的每对<math>(C,\ \gamma)</math>训练SVM，并在不变的验证集上评估性能（或在训练集上进行内部交叉验证，这时每对集合会训练多个SVM）。最后，算法会输出验证中得分最高的超参数。

网格搜索会受到[[维数灾难]]影响，但由于评估的超参数设置通常相互独立，是[[过易并行]]的。<ref name="bergstra"/>

[[File:Hyperparameter Optimization using Random Search.svg|thumb|对两个超参数的不同组合进行随机搜索。此例评估了100种随机选择。绿色条显示，与网格搜索相比，每个超参数的单个值被考虑得更多。]]

=== 随机搜索 ===
随机搜索以随机选择取代所有组合的暴力搜索。这可以简单地应用于上述离散空间，也可推广到连续空间和混合空间。对连续超参数，随机搜索比网格搜索能探索更多的值，<ref name="bergstra" />这时优化问题具有较低的内蕴维度。<ref>{{Cite journal|last1=Ziyu|first1=Wang|last2=Frank|first2=Hutter|last3=Masrour|first3=Zoghi|last4=David|first4=Matheson|last5=Nando|first5=de Feitas|date=2016|title=Bayesian Optimization in a Billion Dimensions via Random Embeddings|journal=Journal of Artificial Intelligence Research|language=en|volume=55|pages=361–387|doi=10.1613/jair.4806|arxiv=1301.1942|s2cid=279236}}</ref>随机搜索也是[[过易并行]]的，此外还允许指定采样分布以纳入先验知识。随机搜索虽简单，但仍是较新的超参数优化方法性能的重要基准。

[[File:Hyperparameter Optimization using Tree-Structured Parzen Estimators.svg|thumb|贝叶斯优化等方法根据以往的观察结果决定下一步探索何种组合，从而智能地探索超参数的潜在选择空间。]]

=== 贝叶斯优化 ===
{{main|{{WikidataLink|Q17002908}}}}
贝叶斯优化是一种针对噪声黑盒函数的全局优化方法。 贝叶斯优化用于超参数优化时建立超参数值到在验证集上目标函数值的函数的概率模型。贝叶斯优化是要据当前模型，迭代地评估较好的超参数配置、再更新，收集尽可能多的观察结果，揭示有关该函数的信息，尤其是最佳值的位置。它试图在探索（结果最不确定的超参数）和利用（预期接近最优的超参数）之间取得平衡。实践中，贝叶斯优化同前两种算法相比<ref name="hutter">{{Citation
 | last1 = Hutter
 | first1 = Frank
 | last2 = Hoos
 | first2 = Holger
 | last3 = Leyton-Brown
 | first3 = Kevin
 | chapter = Sequential Model-Based Optimization for General Algorithm Configuration
 | title = Learning and Intelligent Optimization
 | volume = 6683
 | pages = 507–523
 | year = 2011
 | url = http://www.cs.ubc.ca/labs/beta/Projects/SMAC/papers/11-LION5-SMAC.pdf
 | doi = 10.1007/978-3-642-25566-3_40
 | citeseerx = 10.1.1.307.8813
 | series = Lecture Notes in Computer Science
 | isbn = 978-3-642-25565-6
 | s2cid = 6944647
 | accessdate = 2024-06-08
 | archive-date = 2021-12-28
 | archive-url = https://web.archive.org/web/20211228134050/http://www.cs.ubc.ca/labs/beta/Projects/SMAC/papers/11-LION5-SMAC.pdf
 | dead-url = no
 }}</ref><ref name="bergstra11">{{Citation
 | last1 = Bergstra
 | first1 = James
 | last2 = Bardenet
 | first2 = Remi
 | last3 = Bengio
 | first3 = Yoshua
 | last4 = Kegl
 | first4 = Balazs
 | title = Algorithms for hyper-parameter optimization
 | journal = Advances in Neural Information Processing Systems
 | year = 2011
 | url = http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
 | accessdate = 2024-06-08
 | archive-date = 2020-08-10
 | archive-url = https://web.archive.org/web/20200810160014/https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
 | dead-url = no
 }}</ref><ref name="snoek">{{cite journal
 | last1 = Snoek
 | first1 = Jasper
 | last2 = Larochelle
 | first2 = Hugo
 | last3 = Adams
 | first3 = Ryan
 | title = Practical Bayesian Optimization of Machine Learning Algorithms
 | journal = Advances in Neural Information Processing Systems
 | volume = <!-- -->
 | pages = <!-- -->
 | year = 2012
 | url = http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf
 | bibcode = 2012arXiv1206.2944S
 | arxiv = 1206.2944
 | access-date = 2024-06-08
 | archive-date = 2020-11-01
 | archive-url = https://web.archive.org/web/20201101120738/https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf
 | dead-url = no
 }}</ref><ref name="thornton">{{cite journal
 | last1 = Thornton
 | first1 = Chris
 | last2 = Hutter
 | first2 = Frank
 | last3 = Hoos
 | first3 = Holger
 | last4 = Leyton-Brown
 | first4 = Kevin
 | title = Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms
 | journal = Knowledge Discovery and Data Mining
 | volume = <!-- -->
 | pages = <!-- -->
 | year = 2013
 | url = http://www.cs.ubc.ca/labs/beta/Projects/autoweka/papers/autoweka.pdf
 | bibcode = 2012arXiv1208.3719T
 | arxiv = 1208.3719
 | access-date = 2024-06-08
 | archive-date = 2022-01-20
 | archive-url = https://web.archive.org/web/20220120181554/http://www.cs.ubc.ca/labs/beta/Projects/autoweka/papers/autoweka.pdf
 | dead-url = no
 }}</ref>，由于能在实验前就推理实验质量，因此能以更少的评估次数获得更好的结果。

=== 基于梯度的优化 ===
对特定的学习算法，可以计算超参数的梯度，再用[[梯度下降法]]优化超参数。 这类技术的首次使用主要在神经网络。<ref>{{cite book|last1=Larsen|first1=Jan|last2=Hansen|first2=Lars Kai|last3=Svarer|first3=Claus|last4=Ohlsson|first4=M|title=Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop|chapter=Design and regularization of neural networks: The optimal use of a validation set|date=1996|pages=62–71|doi=10.1109/NNSP.1996.548336|isbn=0-7803-3550-3|citeseerx=10.1.1.415.3266|s2cid=238874|chapter-url=http://orbit.dtu.dk/files/4545571/Svarer.pdf|access-date=2024-06-08|archive-date=2019-08-05|archive-url=https://web.archive.org/web/20190805202535/https://orbit.dtu.dk/files/4545571/Svarer.pdf|dead-url=no}}</ref>此后，到其他模型也有推广，如[[支持向量机]]<ref>{{cite journal |author1=Olivier Chapelle |author2=Vladimir Vapnik |author3=Olivier Bousquet |author4=Sayan Mukherjee |title=Choosing multiple parameters for support vector machines |journal=Machine Learning |year=2002 |volume=46 |pages=131–159 |url=http://www.chapelle.cc/olivier/pub/mlj02.pdf |doi=10.1023/a:1012450327387 |doi-access=free |access-date=2024-06-08 |archive-date=2016-12-24 |archive-url=https://web.archive.org/web/20161224194336/http://www.chapelle.cc/olivier/pub/mlj02.pdf |dead-url=no }}</ref>或逻辑回归。<ref>{{cite journal|author1=Chuong B|author2=Chuan-Sheng Foo|author3=Andrew Y Ng|journal=Advances in Neural Information Processing Systems|volume=20|title=Efficient multiple hyperparameter learning for log-linear models|year=2008|url=http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf|access-date=2024-06-08|archive-date=2016-04-11|archive-url=https://web.archive.org/web/20160411084625/http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf|dead-url=no}}</ref>

获得超参数梯度的另一种方法是用[[自动微分]]，微分迭代优化算法的步骤。<ref>{{cite journal|last1=Domke|first1=Justin|title=Generic Methods for Optimization-Based Modeling|journal=Aistats|date=2012|volume=22|url=http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|access-date=2017-12-09|archive-date=2014-01-24|archive-url=https://web.archive.org/web/20140124182520/http://jmlr.org/proceedings/papers/v22/domke12/domke12.pdf|url-status=dead}}</ref><ref name=abs1502.03492>{{cite arXiv |last1=Maclaurin|first1=Dougal|last2=Duvenaud|first2=David|last3=Adams|first3=Ryan P.|eprint=1502.03492|title=Gradient-based Hyperparameter Optimization through Reversible Learning|class=stat.ML|date=2015}}</ref><ref>{{cite journal |last1=Franceschi |first1=Luca |last2=Donini |first2=Michele |last3=Frasconi |first3=Paolo |last4=Pontil |first4=Massimiliano |title=Forward and Reverse Gradient-Based Hyperparameter Optimization |journal=Proceedings of the 34th International Conference on Machine Learning |date=2017 |arxiv=1703.01785 |bibcode=2017arXiv170301785F |url=http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf |access-date=2024-06-08 |archive-date=2024-02-29 |archive-url=https://web.archive.org/web/20240229075708/http://proceedings.mlr.press/v70/franceschi17a/franceschi17a-supp.pdf |dead-url=no }}</ref><ref>Shaban, A., Cheng, C. A., Hatch, N., & Boots, B. (2019, April). [https://arxiv.org/pdf/1810.10667.pdf Truncated back-propagation for bilevel optimization] {{Wayback|url=https://arxiv.org/pdf/1810.10667.pdf |date=20240324015327 }}. In ''The 22nd International Conference on Artificial Intelligence and Statistics'' (pp. 1723-1732). PMLR.</ref>沿这种思路，最近的一项研究利用[[隐函数定理]]计算超梯度，提出了一种稳定的逆黑塞近似法，可推广到数百万个超参数，需要恒量的内存。<ref>Lorraine, J., Vicol, P., & Duvenaud, D. (2018). [[arxiv:1911.02590|Optimizing Millions of Hyperparameters by Implicit Differentiation]]. ''arXiv preprint arXiv:1911.02590''.</ref>

另一种方法是<ref>Lorraine, J., & Duvenaud, D. (2018). [[arxiv:1802.09419|Stochastic hyperparameter optimization through hypernetworks]]. ''arXiv preprint arXiv:1802.09419''.</ref>训练超网络以逼近最佳响应函数，这种方法也能处理离散超参数。自整定网络<ref>MacKay, M., Vicol, P., Lorraine, J., Duvenaud, D., & Grosse, R. (2019). [[arxiv:1903.03088|Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions]]. ''arXiv preprint arXiv:1903.03088''.</ref>为超网络选择一种紧表示，提供了一种内存效率更高的版本。最近，Δ-STN<ref>Bae, J., & Grosse, R. B. (2020). [[arxiv:2010.13514|Delta-stn: Efficient bilevel optimization for neural networks using structured response jacobians]]. ''Advances in Neural Information Processing Systems'', ''33'', 21725-21737.</ref>对超网络进行轻微的重参数化，加速了训练。Δ-STN还通过在权值中线性化网络，从而消除权值大幅变化带来的非线性影响，从而更好逼近最佳响应雅可比。

超网络之外，梯度方法也可用于优化离散超参数，如对参数进行连续松弛。<ref>Liu, H., Simonyan, K., & Yang, Y. (2018). [[arxiv:1806.09055|Darts: Differentiable architecture search]]. ''arXiv preprint arXiv:1806.09055''.</ref>这类方法已被广泛用于[[神经结构搜索]]中的结构超参数优化。

=== 演化优化 ===
{{main|进化算法}}
演化优化是一种对噪声黑盒函数进行全局优化的方法。<ref name="bergstra11" />超参数优化中，演化优化使用[[进化算法]]搜索给定算法的超参数空间，遵循受[[演化]]启发的过程：

# 创建随机解的初始种群（随机生成超参数元组，通常超过100个）
# 取值，并计算其[[适值函数]]（如使用这些超参数的机器学习算法的10折[[交叉验证]]准确率）
# 按相对适值排序超参数元组
# 用由[[交叉 (遗传算法)|交叉]]与[[变异 (遗传算法)|变异]]产生的新超参数元组取代之前表现较差的
# 重复2-4，直到达到令人满意的算法性能，或无法再提高为止

演化优化已被用于统计机器学习算法、<ref name="bergstra11" />[[自动机器学习]]、典型神经网络、<ref name="kousiouris1">{{cite journal | vauthors = Kousiouris G, Cuccinotta T, Varvarigou T | year = 2011 | title = The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks | url = https://www.sciencedirect.com/science/article/abs/pii/S0164121211000951 | journal = Journal of Systems and Software | volume = 84 | issue = 8 | pages = 1270–1291 | doi = 10.1016/j.jss.2011.04.013 | hdl = 11382/361472 | hdl-access = free | access-date = 2024-06-08 | archive-date = 2024-04-20 | archive-url = https://web.archive.org/web/20240420012343/https://www.sciencedirect.com/science/article/abs/pii/S0164121211000951 | dead-url = no }}</ref>[[深度学习|深度神经网络]]结构搜索中的超参数优化<ref name="miikkulainen1">{{cite arXiv | vauthors = Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B | year = 2017 | title = Evolving Deep Neural Networks |eprint=1703.00548| class = cs.NE }}</ref><ref name="jaderberg1">{{cite arXiv | vauthors = Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K | year = 2017 | title = Population Based Training of Neural Networks |eprint=1711.09846| class = cs.LG }}</ref>，以及深度神经网络中的权重训练。<ref name="such1">{{cite arXiv | vauthors = Such FP, Madhavan V, Conti E, Lehman J, Stanley KO, Clune J | year = 2017 | title = Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning |eprint=1712.06567| class = cs.NE }}</ref>

=== 基于种群 ===
种群训练（Population-Based Training，PBT）同时学习超参数值和网络权重。使用不同超参数的学习过程相互独立进行。与演化算法类似，性能更差的模型会被迭代地替换为从性能较好的模型修改来的超参与权重的模型。替换模型热启动是PBT与其他演化算法的主要区别。因此，PBT允许超参数演化，无需手动调超参。此过程对模型结构、损失函数或训练过程不做任何假设。
PBT及其变体是自适应方法：在模型训练时更新超参。非自适应法的次优策略是在整个训练过程中分配一组恒定的超参。<ref>{{cite arXiv|last1=Li|first1=Ang|last2=Spyra|first2=Ola|last3=Perel|first3=Sagi|last4=Dalibard|first4=Valentin|last5=Jaderberg|first5=Max|last6=Gu|first6=Chenjie|last7=Budden|first7=David|last8=Harley|first8=Tim|last9=Gupta|first9=Pramod|date=2019-02-05|title=A Generalized Framework for Population Based Training|eprint=1902.01894|class=cs.AI}}</ref>

=== 基于提前停止 ===
[[File:Successive-halving-for-eight-arbitrary-hyperparameter-configurations.png|thumb|8个任意超参配置进行连续减半。该方法从具有不同配置的8个模型开始，连续应用连续减半，直到只剩一个模型。]]
一类基于提前停止的超参优化算法专为大型超参数搜索空间设计，尤其是评估一组超参数的计算成本很高时。Irace实现了迭代竞赛算法（iterated racing algorithm），将搜索重点放在最有前景的配置上，用统计测试剔除不佳的。<ref name="irace">{{cite journal |last1=López-Ibáñez |first1=Manuel |last2=Dubois-Lacoste |first2=Jérémie |last3=Pérez Cáceres |first3=Leslie |last4=Stützle |first4=Thomas |last5=Birattari |first5=Mauro |date=2016 |title=The irace package: Iterated Racing for Automatic Algorithm Configuration |journal=Operations Research Perspective |volume=3 |issue=3 |pages=43–58 |doi=10.1016/j.orp.2016.09.002|doi-access=free |hdl=10419/178265 |hdl-access=free }}</ref><ref name="race">{{cite journal |last1=Birattari |first1=Mauro |last2=Stützle |first2=Thomas |last3=Paquete |first3=Luis |last4=Varrentrapp |first4=Klaus |date=2002 |title=A Racing Algorithm for Configuring Metaheuristics |journal=Gecco 2002 |pages=11–18}}</ref>
另一种提前停止超参数优化算法是连续减半算法（SHA），<ref>{{cite arXiv|last1=Jamieson|first1=Kevin|last2=Talwalkar|first2=Ameet|date=2015-02-27|title=Non-stochastic Best Arm Identification and Hyperparameter Optimization|eprint=1502.07943|class=cs.LG}}</ref>一开始是随机搜索，但会定期修剪低性能模型，从而将计算资源集中到更有前景的模型上。异步连续减（ASHA）<ref>{{cite arXiv|last1=Li|first1=Liam|last2=Jamieson|first2=Kevin|last3=Rostamizadeh|first3=Afshin|last4=Gonina|first4=Ekaterina|last5=Hardt|first5=Moritz|last6=Recht|first6=Benjamin|last7=Talwalkar|first7=Ameet|date=2020-03-16|title=A System for Massively Parallel Hyperparameter Tuning|class=cs.LG|eprint=1810.05934v5}}</ref>无需同步评估与修剪，从而进一步提高了SHA的资源利用率。Hyperband<ref>{{cite journal|last1=Li|first1=Lisha|last2=Jamieson|first2=Kevin|last3=DeSalvo|first3=Giulia|last4=Rostamizadeh|first4=Afshin|last5=Talwalkar|first5=Ameet|date=2020-03-16|title=Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization|journal=Journal of Machine Learning Research|volume=18|pages=1–52|arxiv=1603.06560}}</ref>是一种更高级的基于提前停止的算法，可多次调用SHA或ASHA，具有不同程度的剪枝侵占性（aggressiveness），因此适用范围更广，所需输入也更少。

=== 其他 ===
[[径向基函数|RBF]]<ref name=abs1705.08520>{{cite arXiv |eprint=1705.08520|last1=Diaz|first1=Gonzalo|title=An effective algorithm for hyperparameter optimization of neural networks|last2=Fokoue|first2=Achille|last3=Nannicini|first3=Giacomo|last4=Samulowitz|first4=Horst|class=cs.AI|year=2017}}</ref>与[[谱方法]]<ref name=abs1706.00764>{{cite arXiv |eprint=1706.00764|last1=Hazan|first1=Elad|title=Hyperparameter Optimization: A Spectral Approach|last2=Klivans|first2=Adam|last3=Yuan|first3=Yang|class=cs.LG|year=2017}}</ref>也有开发。

== 超参数优化的问题 ==
超参数优化结束后，通常会在训练集上拟合一组超参数，然后根据在验证集的泛化性能或得分来选择。但这种方法可能使验证集的超参数过拟合，因此验证集（在交叉验证中可以是多个）的泛化性能得分不能同时用于估算最终模型的泛化性能。为此，必须在独立于优化超参的集合（且两两不交）上评估，否则性能值可能畸大。这可在第二个测试集上进行，也可由称作嵌套[[交叉验证]]的外部交叉验证，对模型泛化性能进行无偏估计，同时考虑到超参数优化带来的偏差。

== 另见 ==
* [[自动机器学习]]
* [[神经结构搜索]]
* {{WikidataLink|Q6822261}}
* [[模型选择]]
* [[自整定]]
* [[XGBoost]]

== 参考文献 ==
{{Reflist|30em}}

{{Differentiable computing}}

[[Category:机器学习]]
[[Category:数学最佳化]]
[[Category:模型选择]]