查看“︁Dice系数”︁的源代码

{{other uses|Dice}}
'''戴斯系数'''（Dice coefficient），也称索倫森-戴斯系数（Sørensen–Dice coefficient），取名於{{le|托瓦爾·索倫森|Thorvald Sørensen}}和{{le|李·雷蒙德·戴斯|Lee Raymond Dice}}<ref>{{cite journal |last=Dice |first=Lee R.
|title=Measures of the Amount of Ecologic Association Between Species |jstor=1932409
|journal=Ecology |volume=26 |issue=3 |year=1945 |pages=297–302 |doi=10.2307/1932409 }}</ref>，是一种集合相似度度量函数，通常用于计算两个样本的相似度：

:<math>s = \frac{2 | X \cap Y |}{| X | + | Y |} </math>

它在形式上和[[Jaccard指数]]没多大区别，但是有些不同的性质。

和Jaccard类似，它的范围为0到1。 与Jaccard不同的是，相应的差异函数

:<math>d = 1 -  \frac{2 | X \cap Y |}{| X | + | Y |} </math>

不是一个合适的距离度量措施，因为它没有三角形不等性的性质。例如给定 {a}, {b}, 和 {a,b}, 前两个集合的距离为1，而第三个集合和其他任意两个集合的距离为三分之一。

与Jaccard类似, 集合操作可以用两个向量 ''A'' 和''B''的操作来表示:

<math>s_v = \frac{2 | A \cdot B |}{| A |^2 + | B |^2} </math>

上式给出了两个向量的距离输出，也给出了更一般情况下向量之间的相似度度量措施。
戴斯系数可以计算两个字符串的相似度：Dice（s1,s2）=2*comm(s1,s2)/(leng(s1)+leng(s2))。
其中，comm (s1,s2)是s1、s2 中相同字符的个数leng(s1)，leng(s2)是字符串s1、s2 的长度。

在[[信息检索]]中, 给定关键词集合''X'' 和''Y'' ，相似度定义为两倍的共同信息(重叠部分)除以基数的总和 :<ref>{{cite book |last=van Rijsbergen |first=Cornelis Joost |year=1979 |title=Information Retrieval |url=http://www.dcs.gla.ac.uk/Keith/Preface.html |publisher=Butterworths |location=London |isbn=3-642-12274-4 |access-date=2012-05-26 |archive-date=2005-04-06 |archive-url=https://web.archive.org/web/20050406090119/http://www.dcs.gla.ac.uk/Keith/Preface.html |dead-url=no }}</ref>

当作为字符串之间的相似度度量时, 计算两个字符串之间的系数, ''x'' 和''y''，使用 [[bigram]]s 公式如下:<ref>{{cite conference |last=Kondrak |first=Grzegorz |coauthors=Marcu, Daniel; and Knight, Kevin |year=2003 |title=Cognates Can Improve Statistical Translation Models |booktitle=Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics |pages=46–48 |url=http://aclweb.org/anthology/N/N03/N03-2016.pdf |access-date=2012-05-26 |archive-date=2016-03-04 |archive-url=https://web.archive.org/web/20160304035046/http://aclweb.org/anthology/N/N03/N03-2016.pdf |dead-url=no }}</ref>

:<math>s = \frac{2 n_t}{n_x + n_y}</math>

其中''n''<sub>''t''</sub> 是两个字符串共有的bigrams的个数, ''n''<sub>''x''</sub> 是 ''x''中bigrams的个数 ，''n''<sub>''y''</sub> 是 ''y''中bigrams的个数。例如要计算下面两个字符串之间的相似度:

:<code>night</code>
:<code>nacht</code>

我们可以在各个单词中得出如下bigrams集合:
:{<code>ni</code>,<code>ig</code>,<code>gh</code>,<code>ht</code>}
:{<code>na</code>,<code>ac</code>,<code>ch</code>,<code>ht</code>}

每个集合有4个元素, 这个两个集合只有一个相同的元素: <code>ht</code>.

代入公式我们可以计算出, ''s''&nbsp;=&nbsp;(2&nbsp;·&nbsp;1)&nbsp;/&nbsp;(4&nbsp;+&nbsp;4)&nbsp;=&nbsp;0.25.

==同见==
*[[雅卡爾指數]]（Jaccard index）, 等同于: <math>D=2J/(1+J)</math> and <math>J=D/(2-D)</math>
*{{le|Tversky index}}
*[[萊文斯坦距離]]
*[[Sørensen similarity index]]

==参考文献==
{{reflist}}

== 参考资料 ==
<references />

[[Category:信息检索]]
[[Category:字符串相似性度量]]
[[Category:测度论]]