查看“︁最小哈希”︁的源代码

{{TA
|G1=IT
|G2=Math
}}
在[[计算机科学]]领域，'''最小哈希'''（或'''最小哈希式独立排列'''{{tsl|en|locality sensitive hashing|局部性敏感哈希}}）方法是一种快速判断两个集合是否相似的技术。这种方法是由{{harvs|first=Andrei|last=Broder|authorlink=Andrei Broder|year=1997|txt}},<ref name="b97"/>发明的，最初在[[AltaVista]]搜索引擎中用于在搜索结果中检测并消除重复Web页面。<ref name="bcfm">{{citation
 | last1 = Broder | first1 = Andrei Z. | author1-link = Andrei Broder
 | last2 = Charikar | first2 = Moses
 | last3 = Frieze | first3 = Alan M. | author3-link = Alan M. Frieze
 | last4 = Mitzenmacher | first4 = Michael | author4-link = Michael Mitzenmacher
 | contribution = Min-wise independent permutations
 | doi = 10.1145/276698.276781
 | location = New York, NY, USA
 | pages = 327–336
 | publisher = [[计算机协会]]
 | title = {{tsl|en|Symposium on Theory of Computing||Proc. 30th ACM Symposium on Theory of Computing (STOC '98)}}
 | year = 1998}}.</ref>
 
它同样也应用于大规模[[聚类]]问题，比如通过文档间包含的词语相似性进行聚类。<ref name="b97">{{citation
 | last = Broder | first = Andrei Z. | author-link = Andrei Broder
 | contribution = On the resemblance and containment of documents
 | doi = 10.1109/SEQUEN.1997.666900
 | pages = 21–29
 | publisher = [[电气电子工程师学会]]
 | title = Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997
 | year = 1997}}.</ref>

==雅可比相似度与最小哈希值==
[[雅卡尔指数|雅可比相似度系数]]通常用来表示两个集合的相似度，定义{{math|''U''}} 是一个集合，{{math|''A''}} 和 {{math|''B''}} 是 {{math|''U''}}的[[子集]]。雅可比相似度定义如下：<ref>{{citation|first=Paul|last=Jaccard|authorlink=Paul Jaccard|year=1901|title=étude comparative de la distribution florale dans une portion des Alpes et des Jura|journal=Bulletin de la Société Vaudoise des Sciences Naturelles|volume=37|pages=547–579}}.</ref>
:<math> J(A,B) = {{|A \cap B|}\over{|A \cup B|}}.</math>
它是一个0到1之间的数值上，当其为0时表示两个集合不相交，当其为1时表示两个集合相等，其他的情况则在0和1之间。它广泛地用于两集合间相似性的判断：当雅可比系数趋向于1时，两个集合更相似；反之，当雅可比系数趋向于0时，两个集合更不相似。

定义''h''是一个将{{math|''U''}}中的元素映射到一些不相交整数的[[哈希函数]]，{{math|''perm''}} 是集合 {{math|''U''}}中元素的排列[[排列]]，对于任意集合''S''，定义{{math|''h''<sub>min</sub>(''S'')}}为S集合中具有最小h(x)函数值的元素x，对应{{math|''h'' ∘ ''perm''}}（集合{{math|''S''}}的元素{{math|''x''}}是{{math|''h''(''perm''(''x''))}}的最小值）。将{{math|''h''<sub>min</sub>}}应用于 集合{{math|''A''}} 和 {{math|''B''}}，假定没有发生哈希[[碰撞 (计算机科学)|碰撞]]。有{{math|1=''h''<sub>min</sub>(''A'') = ''h''<sub>min</sub>(''B'')}}，当且仅当最小哈希值的并集{{math|''A'' &cup; ''B''}}依赖于交集{{math|''A'' &cap; ''B''}}时。
因此，
:{{math|1=Pr[''h''<sub>min</sub>(''A'') = ''h''<sub>min</sub>(''B'')] = ''J''(''A'',''B'').}}
也就是说，{{math|1=''h''<sub>min</sub>(''A'') = ''h''<sub>min</sub>(''B'')}}是真的[[概率]]等于{{math|''J''(''A'',''B'')}}
另一方面来说，如果{{math|''r''}}是一个当{{math|1=''h''<sub>min</sub>(''A'') = ''h''<sub>min</sub>(''B'')}}时值为1，其它情况下值为0的随机变量，那么{{math|''r''}}可认为是{{math|''J''(''A'',''B'')}}的[[估计量的偏差|无偏估计]]。尽管{{math|''r''}}的[[方差]]很高时，不能很好的估计雅可比相似度，因为<math>r</math>总是0或1。最小哈希思想通过以相同方式构造的几个变量，将其平均在一起来减少这种方差

==算法==
===多哈希函数的变种===
最简单的最小哈希方法是使用{{math|''k''}}个不同的哈希函数，其中{{math|''k''}}是固定的整数参数，使用这{{math|''k''}}个函数所对应的{{math|''k''}}个{{math|''h''<sub>min</sub>(''S'')}}值来描述每个集合{{math|''S''}}。
使用这种最简单的版本来判断{{math|''J''(''A'',''B'')}}，假定y是使得{{math|1=''h''<sub>min</sub>(''A'') = ''h''<sub>min</sub>(''B'')}}的哈希函数个数，使用{{math|''y''/''k''}}作为估计。则此估计是{{math|''k''}}个不同的0-1随机变量的平均值，其中每个随机变量当{{math|1=''h''<sub>min</sub>(''A'') = ''h''<sub>min</sub>(''B'')}}值为1，反之为0，并且是{{math|''J''(''A'',''B'')}}的无偏估计。因此，该平均值同样也是一个无偏估计，而且通过0-1随机变量之和的标准{{tsl|en|Chernoff bound|Chernoff上界}}可得知，其期望误差是{{math|O(1/&radic;''k'')}}。所以，针对任意给定的常数{{math|&epsilon; > 0}}，存在另一常数{{math|1=''k'' = O(1/&epsilon;<sup>2</sup>)}}，其估计的期望误差不超过{{math|&epsilon;}}。例如，使用400个哈希函数值来估计{{math|''J''(''A'',''B'')}}，其期望误差将小于或等于.05。

===单一哈希函数的变种===
计算多个哈希函数的代价是相当昂贵的，因此有关最小哈希方法的另一种实现方法是仅使用单一的哈希函数来避免这个问题。对于每个集合，使用这个单一的哈希函数选出其中的多个值，而不是每个哈希函数选择一个值。假定{{math|''h''}}是一个哈希函数，{{math|''k''}}是一个固定整数。如果{{math|''S''}}是{{math|''h''}}域上{{math|''k''}}或更多元素的集合，则定义{{math|''h''<sub>(''k'')</sub>(''S'')}}为{{math|''S''}}中具有最小{{math|''h''}}值的{{math|''k''}}个元素所组成的子集。该子集{{math|''h''<sub>(''k'')</sub>(''S'')}}可用作集合{{math|''S''}}的一个''签名''，任意两个集合间的相似度可通过比较它们的签名来计算。

特别地，假定''A'' and ''B''为任意两个集合，{{math|1=''X'' = ''h''<sub>(''k'')</sub>(''h''<sub>(''k'')</sub>(''A'') &cup; ''h''<sub>(''k'')</sub>(''B'')) = ''h''<sub>(''k'')</sub>(''A'' &cup; ''B'')}}是{{math|''A'' &cup; ''B''}}的''k''个元素的集合，如果''h''是随机变量并且''k''个元素的任意子集等可能地被选择。也就是说，{{math|''X''}}是{{math|''A'' &cup; ''B''}}的{{tsl|en|simple random sample|简单随机样本}}。{{math|1=''Y'' = ''X'' &cap; ''h''<sub>(''k'')</sub>(''A'') &cap; ''h''<sub>(''k'')</sub>(''B'')}}是集合{{math|''X''}}中属于{{math|''A'' &cap; ''B''}}交集的元素。因此，|{{math|''Y''}}|/{{math|''k''}}是{{math|''J''(''A'',''B'')}}的无偏估计。单一哈希函数的估计与多个哈希函数产生的估计的不同在于{{math|''X''}}总是有{{math|''k''}}个元素，而多个哈希函数由于两个不同的哈希函数可能会产生相同的最小值，因此可能会产生更少的样本元素。然而，当{{math|''k''}}相对集合大小来说很小时，该区别可忽略不计。

通过不重复取样的标准{{tsl|en|Chernoff bound|Chernoff上界}}，该估计的期望误差为{{math|O(1/&radic;''k'')}}，其性能与多个哈希函数方法相匹配。

===耗时分析===
|{{math|''Y''}}|/{{math|''k''}}估计通过给定集合的两个签名能够在{{math|O(''k'')}}能够计算出来，因此，当{{math|&epsilon;}} and {{math|''k''}}为常数时，从签名中计算相似度估计的时间也为常数，这样当众多两两相似度需要计算时，该方法在运行时间上与每个集合中元素的完全比较相比，能够有实质性的优化。

==最小哈希式独立排列==
为了实现上述的最小哈希方法，哈希函数{{math|''h''}}需要定义{{math|''n''}}元素上的一个随机排列，这里的{{math|''n''}}是指待比较的所有集合并集中不相交元素的总数。
但是由于存在{{math|''n''!}}个不同的排列，仅仅指定一个真正随机的排列就需要{{math|&Omega;(''n'' log ''n'')}}位，即使{{math|''n''}}一般时，这个数值也很大。基于这样的事实，与{{tsl|en|universal hashing|全域哈希}}相类似的理论，有大量的研究工作寻找“最小哈希式独立的”一簇排列，意指针对域的任意子集，任何元素都与其最小值是等可能的。已经证明，最小哈希式独立的排列簇至少必须包含:<math>lcm(1, 2, ..., n) \ge e^{n-o(n)}</math>个不同的排列，因此它需要{{math|&Omega;(''n'')}}位来指定一个排列，这个数值仍然很大。<ref name="bcfm"/>

由于实践上不可行，引入了最小哈希式独立的两个变型概念：严格最小哈希式独立排列簇和近似最小哈希式独立排列簇。
严格的最小哈希式独立是指最小哈希式独立属性被限制在集合基数至多为{{math|''k''}}的一些集合中。<ref>{{citation
 | last1 = Matou?ek | first1 = Ji?í | author1-link = Ji?í Matou?ek (mathematician)
 | last2 = Stojakovi? | first2 = Milo?
 | doi = 10.1002/rsa.10101
 | issue = 4
 | journal = Random Structures and Algorithms
 | pages = 397–408
 | title = On restricted min-wise independence of permutations
 | volume = 23
 | year = 2003}}.</ref>
近似最小哈希式独立最多有一个固定的概率{{math|&epsilon;}}变化为完全独立。<ref>{{citation
 | last1 = Saks | first1 = M. | author1-link = Michael Saks (mathematician)
 | last2 = Srinivasan | first2 = A.
 | last3 = Zhou | first3 = S.
 | last4 = Zuckerman | first4 = D.
 | doi = 10.1016/S0020-0190(99)00163-5
 | issue = 1–2
 | journal = {{tsl|en|Information Processing Letters||Information Processing Letters}}
 | pages = 29–32
 | title = Low discrepancy sets yield approximate min-wise independent permutation families
 | volume = 73
 | year = 2000}}.</ref>

==应用==
最小哈希的最初应用包括在Web文档中聚类并消除近似重复，这通过在那些文档中出现的词语集合来描述。<ref name="b97"/><ref name="bcfm"/> 相似的技术也应用于其他类型数据的聚类和近似重复消除，如图片：在图片数据中，一张图片可以通过分割用很多更小的子图片集合或更多复杂图片特征的描述集合来表示。<ref>{{citation
 | last1 = Chum | first1 = Ond?ej
 | last2 = Philbin | first2 = James
 | last3 = Isard | first3 = Michael
 | last4 = Zisserman | first4 = Andrew
 | contribution = Scalable near identical image and shot detection
 | doi = 10.1145/1282280.1282359
 | title = Proceedings of the 6th ACM International Conference on Image and Cideo Retrieval (CIVR'07)
 | year = 2007}}; {{citation
 | last1 = Chum
 | first1 = Ond?ej
 | last2 = Philbin
 | first2 = James
 | last3 = Zisserman
 | first3 = Andrew
 | contribution = Near duplicate image detection: min-hash and tf-idf weighting
 | page = 4
 | title = Proceedings of the British Machine Vision Conference
 | url = http://www.bmva.org/bmvc/2008/papers/119.pdf
 | volume = 3
 | year = 2008
 | accessdate = 2012-05-18
 | archive-date = 2021-02-24
 | archive-url = https://web.archive.org/web/20210224225910/http://www.bmva.org/bmvc/2008/papers/119.pdf
 | dead-url = no
 }}.</ref>
 
{{harvtxt|Schleimer|Wilkerson|Aiken|2003}}使用最小哈希技术作为数字文档剽窃检测方法的一部分，他们的方法将文档表示成给定长度的子串集合，将文档划分成更大固定长度的窗口，然后使用子串的最小哈希值作为每个窗口的描述值。如果文本的拷贝部分比两倍窗口尺寸还要长，则该描述值将肯定匹配保存在数据库中众多描述值中的一个，这样那个窗口就可以用来检查有多少内容是拷贝的。<ref>{{citation
 | last1 = Schleimer | first1 = Saul
 | last2 = Wilkerson | first2 = Daniel S.
 | last3 = Aiken | first3 = Alex
 | contribution = Winnowing: local algorithms for document fingerprinting
 | doi = 10.1145/872757.872770
 | pages = 76–85
 | title = Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03)
 | year = 2003}}.</ref>

在[[数据挖掘]]领域，{{harvtxt|Cohen|Datar|Fujiwara|Gionis|2001}}使用最小哈希技术作为[[关联规则学习]]的工具。给定一个数据库，其中每一项都有多个属性（可看作是每行为一个数据库项， 每列为一个属性的[[0-1矩阵]]），他们将最小哈希的近似度方法应用于Jaccard系数，用来辨别频繁共同出现的属性候选对，然后仅计算这些候选对的确切系数值，以确定哪些项目共同出现的频度低于一个给定的严格阈值。<ref>{{citation
 | last1 = Cohen | first1 = E.
 | last2 = Datar | first2 = M.
 | last3 = Fujiwara | first3 = S.
 | last4 = Gionis | first4 = A.
 | last5 = Indyk | first5 = P. | author5-link = Piotr Indyk
 | last6 = Motwani | first6 = R. | author6-link = Rajeev Motwani
 | last7 = Ullman | first7 = J. D. | author7-link = Jeffrey Ullman
 | last8 = Yang | first8 = C.
 | doi = 10.1109/69.908981
 | issue = 1
 | journal = IEEE Transactions on Knowledge and Data Engineering
 | pages = 64–78
 | title = Finding interesting associations without support pruning
 | volume = 13
 | year = 2001}}.</ref>

==相关主题==
最小哈希方法可看作是{{tsl|en|locality sensitive hashing|局部性敏感哈希}}的一个实例。局部性敏感哈希是使用哈希将大集合的数据对象映射到更小的哈希值的技术集合，通过这样的方法当两个对象距离相近时，它们的哈希值也可以相同。在最小哈希方法实例中，一个集合的签名可看作是它的哈希值。其它局部性敏感哈希技术还有针对集合间的[[汉明距离|海明距离]]，以及[[向量]]间的[[余弦相似性|余弦距离]]等。另外，局部性敏感哈希还在[[最邻近搜索|最近邻搜索]]算法有着重要的应用。<ref>{{citation
 | last1 = Andoni | first1 = Alexandr
 | last2 = Indyk | first2 = Piotr | author2-link = Piotr Indyk
 | doi = 10.1145/1327452.1327494
 | issue = 1
 | journal = [[ACM通讯]]
 | pages = 117–122
 | title = Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
 | volume = 51
 | year = 2008}}.</ref>


==参考文献==
{{reflist}}

[[Category:哈希函数]]
[[Category:聚类条件]]