查看“︁Hinge loss”︁的源代码

[[File:Hinge_loss_vs_zero_one_loss.svg|缩略图|{{Math|''t'' {{=}} 1}} 时变量 {{Mvar|y}}（水平方向）的铰链损失（蓝色，垂直方向）与0/1损失（垂直方向；绿色为 {{Math|''y'' < 0}} ，即分类错误）。注意铰接损失在 {{Math|abs(''y'') < 1}} 时也会给出惩罚，对应于支持向量机中间隔的概念。]]
在[[機器學習]]中，'''鉸鏈損失'''是一個用於訓練分類器的[[損失函數]]。鉸鏈損失被用於「最大間格分類」，因此非常適合用於[[支持向量機]] (SVM)。<ref>{{Cite journal|title=Are Loss Functions All the Same?|url=http://web.mit.edu/lrosasco/www/publications/loss.pdf|last=Rosasco|first=L.|last2=De Vito|first2=E. D.|journal=Neural Computation|issue=5|doi=10.1162/089976604773135104|year=2004|volume=16|pages=1063–1076|pmc=|pmid=15070510|last3=Caponnetto|first3=A.|last4=Piana|first4=M.|last5=Verri|first5=A.|access-date=2019-06-04|archive-date=2020-01-11|archive-url=https://web.archive.org/web/20200111013717/http://web.mit.edu/lrosasco/www/publications/loss.pdf|dead-url=no}}</ref>
对于一个预期输出 <math>t={\pm}1</math>，分类结果 <math>y</math> 的鉸鏈損失定義為

:<math>\ell(y) = \max(0, 1-t \cdot y)</math>

特別注意：以上式子的<math>y</math>應該使用分類器的「原始輸出」，而非預測標籤。例如，在線性支持向量機當中，<math>y = \mathbf{w} \cdot \mathbf{x} + b</math>，其中 <math>(\mathbf{w},b)</math> 是[[超平面]]参数，<math>\mathbf{x}</math>是輸入資料點。

當<math>t</math>和<math>y</math>同號（意即分類器的輸出<math>y</math>是正確的分類），且 <math>|y| \ge 1</math>时，鉸鏈損失 <math>\ell(y) = 0</math>。但是，當它們異號（意即分類器的輸出<math>y</math>是错误的分類）時，<math>\ell(y)</math> 隨 <math>y</math> 線性增長。套用相似的想法，如果 <math>|y| < 1</math>，即使 <math>t</math> 和 <math>y</math> 同號（意即分類器的分類正確，但是間隔不足），此時仍然會有損失。

== 扩展 ==
二元支持向量机经常通过一对多（winner-takes-all strategy，WTA SVM）或一对一（max-wins voting，MWV SVM）策略来扩展为[[多元分类]]，<ref name="duan2005">{{Cite book|last=Duan|first=K. B.|last2=Keerthi|first2=S. S.|title=Multiple Classifier Systems|doi=10.1007/11494683_28|series=[[Lecture Notes in Computer Science|LNCS]]|volume=3541|pages=278–285|year=2005|isbn=978-3-540-26306-7|chapterurl=http://www.keerthis.com/multiclass_mcs_kaibo_05.pdf|pmid=|pmc=|publisher=|location=|chapter=Which Is the Best Multiclass SVM Method? An Empirical Study|access-date=2019-06-04|archive-date=2017-08-08|archive-url=https://web.archive.org/web/20170808220147/http://keerthis.com/multiclass_mcs_kaibo_05.pdf|dead-url=no}}</ref>
铰接损失也可以做出类似的扩展，已有数个不同的[[多元分类]]铰接损失的变体被提出。<ref name="unifiedview">{{Cite journal|title=A Unified View on Multi-class Support Vector Classification|url=http://www.jmlr.org/papers/volume17/11-229/11-229.pdf|last=Doğan|first=Ürün|last2=Glasmachers|first2=Tobias|journal=[[Journal of Machine Learning Research]]|year=2016|volume=17|pages=1–32|last3=Igel|first3=Christian|access-date=2019-06-04|archive-date=2018-05-05|archive-url=https://web.archive.org/web/20180505030958/http://www.jmlr.org/papers/volume17/11-229/11-229.pdf|dead-url=no}}</ref> 例如，Crammer 和 Singer <ref>{{Cite journal|title=On the algorithmic implementation of multiclass kernel-based vector machines|url=http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf|last=Crammer|first=Koby|last2=Singer|first2=Yoram|journal=[[Journal of Machine Learning Research]]|year=2001|volume=2|pages=265–292|access-date=2019-06-04|archive-date=2015-08-29|archive-url=https://web.archive.org/web/20150829102651/http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf|dead-url=no}}</ref>
将一个多元线性分类的铰链损失定义为<ref>{{cite conference |first1=Robert C. |last1=Moore |first2=John |last2=DeNero |title=L<sub>1</sub> and L<sub>2</sub> regularization for multiclass hinge loss models |url=http://www.ttic.edu/sigml/symposium2011/papers/Moore+DeNero_Regularization.pdf |booktitle=Proc. Symp. on Machine Learning in Speech and Language Processing |year=2011 |access-date=2019-06-04 |archive-date=2017-08-28 |archive-url=https://web.archive.org/web/20170828233715/http://www.ttic.edu/sigml/symposium2011/papers/Moore+DeNero_Regularization.pdf |dead-url=no }}</ref>

: <math>\ell(y) = \max(0, 1 + \max_{y \ne t} \mathbf{w}_y \mathbf{x} - \mathbf{w}_t \mathbf{x})</math>

其中 <math>t</math> 为目的标签， <math>\mathbf{w}_t</math> 和 <math>\mathbf{w}_y</math> 该模型的参数。

Weston 和 Watkins 提出了一个类似的定义，但使用[[求和]]代替了最大值：<ref>{{cite conference |first1=Jason |last1=Weston |first2=Chris |last2=Watkins |title=Support Vector Machines for Multi-Class Pattern Recognition |url=https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf |booktitle=European Symposium on Artificial Neural Networks |year=1999 |access-date=2019-06-04 |archive-date=2018-05-05 |archive-url=https://web.archive.org/web/20180505024710/https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf |dead-url=no }}</ref><ref name="unifiedview">{{Cite journal|title=A Unified View on Multi-class Support Vector Classification|url=http://www.jmlr.org/papers/volume17/11-229/11-229.pdf|last=Doğan|first=Ürün|last2=Glasmachers|first2=Tobias|journal=[[Journal of Machine Learning Research]]|year=2016|volume=17|pages=1–32|last3=Igel|first3=Christian}}</ref>

: <math>\ell(y) = \sum_{y \ne t} \max(0, 1 + \mathbf{w}_y \mathbf{x} - \mathbf{w}_t \mathbf{x})</math>

在结构预测中，铰接损失可以进一步扩展到结构化输出空间。支持间隔调整的结构化支持向量机 可以使用如下所示的铰链损失变体，其中 {{Math|'''w'''}} 表示SVM的参数， {{Math|'''y'''}} 为SVM的预测结果，{{Mvar|φ}} 为联合特征函数，{{Math|Δ}} 为[[汉明距离|汉明损失]]:

: <math>\begin{align}
\ell(\mathbf{y}) & = \max(0, \Delta(\mathbf{y}, \mathbf{t}) + \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{y}) \rangle - \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{t}) \rangle) \\
         & = \max(0, \max_{y \in \mathcal{Y}} \left( \Delta(\mathbf{y}, \mathbf{t}) + \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{y}) \rangle \right) - \langle \mathbf{w}, \phi(\mathbf{x}, \mathbf{t}) \rangle)
\end{align}</math>

== 优化算法 ==
铰链损失是一种[[凸函数]]，因此许多机器学习中常用的凸优化器均可用于优化铰链损失。 它不是[[可微函数]]，但拥有一个关于线性 SVM 模型参数 {{Math|'''w'''}} 的[[次导数]] 

: <math>\frac{\partial\ell}{\partial w_i} = \begin{cases}
 -t \cdot x_i & \text{if } t \cdot y < 1 \\
 0      & \text{otherwise}
\end{cases}</math>

其[[评分函数]]为 <math>y = \mathbf{w} \cdot \mathbf{x}</math>
[[File:Hinge_loss_variants.svg|缩略图|三个铰链损失的变体 {{Math|''z'' {{=}} ''ty''}}：“普通变体”（蓝色），平方变体（绿色），以及 Rennie 和 Srebro 提出的分段平滑变体（红色）。]]
然而，由于铰接损失在 <math>ty = 1</math>处不可导， Zhang 建议在优化时可使用[[光滑函数|平滑的]]变体建议，<ref name="zhang">{{cite conference |last=Zhang |first=Tong |title=Solving large scale linear prediction problems using stochastic gradient descent algorithms |conference=ICML |year=2004 |url=http://tongzhang-ml.org/papers/icml04-stograd.pdf |access-date=2019-06-04 |archive-date=2019-06-04 |archive-url=https://web.archive.org/web/20190604102310/http://tongzhang-ml.org/papers/icml04-stograd.pdf |dead-url=no }}</ref> 如Rennie 和 Srebro 提出的分段平滑<ref>{{cite conference |title=Loss Functions for Preference Levels: Regression with Discrete Ordered Labels |first1=Jason D. M. |last1=Rennie |first2=Nathan |last2=Srebro |conference=Proc. [[IJCAI]] Multidisciplinary Workshop on Advances in Preference Handling |year=2005 |url=http://ttic.uchicago.edu/~nati/Publications/RennieSrebroIJCAI05.pdf |access-date=2019-06-04 |archive-date=2015-11-06 |archive-url=https://web.archive.org/web/20151106010902/http://ttic.uchicago.edu/~nati/Publications/RennieSrebroIJCAI05.pdf |dead-url=no }}</ref>

: <math>\ell(y) = \begin{cases}
\frac{1}{2} - ty    & \text{if} ~~ ty \le 0, \\
\frac{1}{2} (1 - ty)^2 & \text{if} ~~ 0 < ty \le 1, \\
0           & \text{if} ~~ 1 \le ty
\end{cases}</math>

或平方平滑。

: <math>\ell_\gamma(y) = \begin{cases}
\frac{1}{2\gamma} \max(0, 1 - ty)^2 & \text{if} ~~ ty \ge 1 - \gamma \\
1 - \frac{\gamma}{2} - ty      & \text{otherwise}
\end{cases}</math>

Modified Huber loss <math>L</math>是<math>\gamma = 2</math>时损失函数的特例，此时 <math>L(t,y) = 4 \ell_2(y)</math>中。

== 参考文献 ==
{{Reflist}}
[[Category:机器学习]]
[[Category:支持向量机]]
[[Category:损失函数]]
[[Category:有未审阅翻译的页面]]