查看“︁旋转哈希”︁的源代码

{{TA
|G1=IT
}}
'''旋转哈希'''（也称为'''滚动哈希'''、'''递归哈希'''、'''滚动校验和'''或'''滑动哈希'''）是一种[[散列函数|哈希函数]]，输入的内容在一个窗口中进行移动哈希。

少数哈希函数允许快速计算滚动哈希值 — 只给出旧的哈希值，新的哈希值被快速计算出来，旧的值从窗口中移除，新的值添加到窗口中 — 类似于[[移動平均]]函数的计算方式，比其他低通滤波器更快。

主要应用之一是[[Rabin–Karp算法|Rabin–Karp字符串搜索算法]]，该算法使用下面描述的滚动哈希。另一个流行的应用是[[rsync]]程序，它使用基于Mark Adler的[[adler-32]]的校验和作为滚动哈希。低带宽网络文件系统（LBFS）使用[[Rabin指纹]]作为其滚动哈希。

滚动哈希值最多只能是{{tsl|en|pairwise independent|成对独立}}的<ref name="lemirekaser">Daniel Lemire, Owen Kaser: Recursive ''n''-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698–710, 2010. [[arXiv:0705.4676]].</ref>或{{tsl|en|universal hashing|强通用}}的。例如，它们不能是{{tsl|en|k-independent hashing|三向独立}}的。

== 多项式滚动哈希 ==
通常使用仅包含乘法和加法的滚动哈希函数来解释[[Rabin–Karp算法|Rabin–Karp字符串搜索算法]]：

:<math>H = c_1 a^{k-1} + c_2 a^{k-2} + c_3 a^{k-3} + ... + c_k a^{0},</math>

其中<math>a</math>是一个常数，并且<math>c_1, ..., c_k</math>是输入字符（但此函数不是[[Rabin指纹]]，见下文）。

为了避免操纵巨大的<math>H</math>值，所有数学运算都要取模<math>n</math>。<math>a</math>和<math>n</math>的选择对获得良好哈希至关重要；更多的讨论请参见[[線性同餘方法|线性同余生成器]]。

移除和增加字符只需将第一项或最后一项加减即可。将所有字符向左移动一个位置，需要将整个和<math>H</math>乘以<math>a</math>。将所有字符向右移一个位置，需要将整个和<math>H</math>除以<math>a</math>。注意，在取模运算中，<math>a</math>可以选择更换为模倒数<math>a^{-1}</math>，据此，可以在不实际执行除法的情况下，通过将<math>H</math>与之相乘的方法得到除法的结果。

== Rabin指纹 ==

[[Rabin指纹]]是另一种哈希，它也将输入解释为多项式，但是在[[伽罗瓦域]]{{tsl|en|GF(2)|GF(2)}}（即[[除以2]]的[[同余]]）上。它不把输入视为字节的多项式，而是将其视为位的多项式，并且所有算术均在GF(2)中完成（类似于[[循環冗餘校驗|CRC32]]）。哈希是该多项式除以GF(2)的不可约多项式的结果。可以只使用进入和离开的字节来更新Rabin指纹，使其成为有效的滚动哈希。

因为它与Rabin-Karp字符串搜索算法有着相同的作者，而Rabin-Karp字符串搜索算法经常被用另一种更简单的滚动哈希来解释，而且这种更简单的滚动哈希也是一种多项式，所以这两种滚动哈希通常彼此混淆。 备份软件[https://restic.net/ restic] {{Wayback|url=https://restic.net/ |date=20210309112829 }}使用Rabin指纹来分割文件，其Blob大小在512字节和8MiB之间变化。
<ref>{{Cite web|url=https://restic.readthedocs.io/en/stable/100_references.html#backups-and-deduplication|title=References — restic 0.9.0 documentation|website=restic.readthedocs.io|language=en|access-date=2018-05-24|archive-date=2020-12-25|archive-url=https://web.archive.org/web/20201225191731/https://restic.readthedocs.io/en/stable/100_references.html#backups-and-deduplication|dead-url=no}}</ref>

== 循环多项式 ==

通过循环多项式<ref>Jonathan D. Cohen, Recursive Hashing Functions for ''n''-Grams, ACM Trans. Inf. Syst. 15 (3), 1997.</ref> — 有时也被称为Buzhash — 进行哈希，也很简单，但它的好处是避免了乘法，而是使用[[桶式移位器]]。这是{{tsl|en|tabulation hashing|列表哈希}}的一种形式：它假定存在一些从字符到整数区间<math>[0,2^L)</math>的哈希函数<math>h</math>。该哈希函数可以是一个简单的数组，也可以是一个将字符映射到随机整数的[[哈希表]]。让函数<math>s</math>是一个循环二进制旋转（或循环移位 ）：它将位向左旋转1，将最新位推到第一个位置。 例如，<math>s(10011)=00111</math>。<math>\oplus</math>是按位[[逻辑异或|异或]]。哈希值定义为
:<math> H = s^{k-1}(h( c_1 )) \oplus s^{k-2}( h(c_2) )  \oplus \ldots \oplus  s( h(c_{k-1}) ) \oplus   h(c_k),</math>
其中2的幂的乘法可以通过二进制移位来实现。结果是一个在区间<math>[0,2^L)</math>中的数。

以滚动方式计算哈希值的方法如下。 让<math>H</math>是先前的哈希值。 旋转<math>H</math>一次：<math>H \leftarrow s(H)</math>。如果<math>c_1</math>是要删除的字符，旋转它<math>k</math>次：<math>s^{k}(h( c_1 ))</math>。然后简单地设置

:<math>H \leftarrow s(H) \oplus s^{k}(h( c_1 )) \oplus h(c_{k+1}),</math>

这里<math>c_{k+1}</math>是增加的新字符。

由循环多项式进行哈希处理具有很强的普遍性或对偶性：只需保持第一个<math>L-k+1</math>位。也就是说，取结果<math>H</math>并且不需要考虑任何<math>k-1</math>连续的位。<ref name="lemirekaser" />在实际操作中，这可以通过整数除法<math>H \rightarrow H \div 2^{k-1}</math>来实现。

备份软件{{tsl|en|Attic_(backup_software)|Attic（备份软件）|Attic}}使用可自定义分块大小范围的Buzhash算法来切分文件流。<ref>{{Cite web|url=https://borgbackup.readthedocs.io/en/stable/internals/data-structures.html#chunker-details|title=Data structures and file formats — Borg - Deduplicating Archiver 1.1.5 documentation|website=borgbackup.readthedocs.io|language=en|access-date=2018-05-24|archive-date=2021-03-11|archive-url=https://web.archive.org/web/20210311183102/https://borgbackup.readthedocs.io/en/stable/internals/data-structures.html#chunker-details|dead-url=no}}</ref>

== 使用滚动哈希进行基于内容的分片 ==
滚动哈希函数的一个有趣的用例是，它可以创建动态的、基于内容的流或文件的分块。当需要在网络上只发送一个大文件的变化块时，这一点特别有用，因为在文件的最前面增加一个简单的字节会导致所有固定大小的窗口成为更新状态，而实际上只有第一个“块”被修改。

计算动态分块的最简单方法是计算滚动哈希，如果它符合一个模式（例如低N位全为零），那么它就是一个分块边界。这种方法可以确保文件的任何变化都只会影响其当前和可能的下一个分块，而不会影响其他的。

当知道边界后，需要通过对其哈希值进行比较，来检测哪个分块被修改，需要在网络上传输。
<ref>{{cite web|url=http://blog.teamleadnet.com/2012/10/rabin-karp-rolling-hash-dynamic-sized.html|title=Rabin Karp rolling hash - dynamic sized chunks based on hashed content|last=Horvath|first=Adam|date=October 24, 2012|language=en|accessdate=2020-06-11|archive-date=2013-02-24|archive-url=https://web.archive.org/web/20130224074325/http://blog.teamleadnet.com/2012/10/rabin-karp-rolling-hash-dynamic-sized.html|dead-url=no}}</ref>

== 使用移动和的基于内容的切片 ==
一些程序，包括gzip（带有<code>--rsyncable</code>选项）和rsyncrypto，会根据这个特定的（未加权的）移动和进行基于内容的分片：<ref>{{Cite web |url=http://rsyncrypto.lingnu.com/index.php/Algorithm |website=rsyncrypto.lingnu.com |title=Rsyncrypto Algorithm |language=en |access-date=2020-06-11 |archive-url=https://web.archive.org/web/20160815185121/http://rsyncrypto.lingnu.com/index.php/Algorithm |archive-date=2016-08-15 |dead-url=yes }}</ref>
:<math>S(n) =  \sum_{i = n - 8195}^{n} c_i,</math>

:<math>H(n) = S(n) \mod 4096,</math>

其中

* <math>S(n)</math>是8196个连续字节之和，以字节<math>n</math>结尾（需要21位存储空间）；
* <math>c_i</math>是文件的第<math>i</math>个字节；
* <math>H(n)</math>是一个“哈希值”，由<math>S(n)</math>底部12位组成。

将窗口移动一个字节，只需将新字符添加到总和中，并从总和中减去最旧的字符（不再在窗口中）。

对于每个使<math>H(n) == 0</math>的<math>n</math>，这些程序会在<math>n</math>和<math>n+1</math>之间切开文件。这种方法将确保文件中的任何变化将只影响其当前和可能的下一个分块，而不会影响其他分块。

== Gear指纹以及快速基于内容分块算法FastCDC ==
基于内容的切片算法需要逐字节地对于字符串进行哈希计算，并在匹配到特定的模式时进行分片处理。逐字节的比较会带来巨大的额外计算开销，而 FastCDC  <ref name="Xia Zhou Jiang Feng">{{cite web | last=Xia | first=Wen | last2=Zhou | first2=Yukun | last3=Jiang | first3=Hong | last4=Feng | first4=Dan | last5=Hua | first5=Yu | last6=Hu | first6=Yuchong | last7=Liu | first7=Qing | last8=Zhang | first8=Yucheng | title=FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication | website=USENIX | url=https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia | access-date=2020-07-24 | archive-date=2020-08-22 | archive-url=https://web.archive.org/web/20200822155142/https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia | dead-url=no }}</ref>
算法则提出了一种快速计算基于内容的分块算法。这种算法采用了基于滚动哈希的 <math> Gear </math> 指纹<ref name="XiaJiang2014">{{cite journal|last1=Xia|first1=Wen|last2=Jiang|first2=Hong|last3=Feng|first3=Dan|last4=Tian|first4=Lei|last5=Fu|first5=Min|last6=Zhou|first6=Yukun|title=Ddelta: A deduplication-inspired fast delta compression approach|journal=Performance Evaluation|volume=79|year=2014|pages=258–272|issn=01665316|doi=10.1016/j.peva.2014.07.016}}</ref> 算法，跳过最小分段长度，同时可以进行长度归一化，同时滑动两个字节等操作。这样可以大大加快分块算法的处理速度，处理速度可以达到原先的 3-12 倍<ref name="IEEE Journals & Magazine 2020">{{cite web | title=The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems | website=IEEE Journals & Magazine | date=2020-06-16 | url=https://ieeexplore.ieee.org/abstract/document/9055082/ | access-date=2020-07-22 | archive-date=2020-07-24 | archive-url=https://web.archive.org/web/20200724085229/https://ieeexplore.ieee.org/abstract/document/9055082/ | dead-url=no }}</ref>。
基础版本的算法伪代码如下：

 '''algorithm''' FastCDC
     '''input:''' data buffer ''src'' , 
            data length ''n'', 
     '''output:''' cut point ''i''
     
     ''MinSize'' &larr; 2KB     //split minimum chunk size is 2KB
     ''MaxSize'' &larr; 64KB    //split maximum chunk size is 64KB
     ''fp'' &larr; ''0''
     ''i'' &larr; ''MinSize''
     ''Mask'' &larr; ''0x0000d93003530000LL''
     
     // buffer size is less than minimum chunk size
     '''if''' ''n'' ≤ ''MinSize'' '''then'''
         return n;
     '''if''' ''n'' ≥ ''MaxSize'' '''then'''
         ''n'' &larr; ''MaxSize''
      
     '''while''' ''i'' < ''n'' '''do'''
         ''fp'' &larr; (''fp'' << 1 ) + ''Gear''[''src''[''i'']]
         '''if''' !(''fp'' & ''Mask'') '''then'''
             '''return''' ''i''
    
     '''return''' ''i''

其中

* <math>Gear</math> 指纹是提前计算的一个哈希散列数组。它采用了 <math>Gear </math> 哈希算法，其保证了哈希结果分布均匀性的同时可以快速地计算哈希值。与传统的 <math>Rabin</math> 算法相比，它的计算速度更快。经过实验比较 <ref name="IEEE Journals & Magazine 2020">{{cite web | title=The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems | website=IEEE Journals & Magazine | date=2020-06-16 | url=https://ieeexplore.ieee.org/abstract/document/9055082/ | access-date=2020-07-22}}</ref>，在进行数据分段的时候，同 <math>Rabin</math> 算法相比，他们产生的数据块大小分布几乎一致而 <math> Gear</math> 算法需要的时间更短 <ref name="XiaJiang2014">{{cite journal|last1=Xia|first1=Wen|last2=Jiang|first2=Hong|last3=Feng|first3=Dan|last4=Tian|first4=Lei|last5=Fu|first5=Min|last6=Zhou|first6=Yukun|title=Ddelta: A deduplication-inspired fast delta compression approach|journal=Performance Evaluation|volume=79|year=2014|pages=258–272|issn=01665316|doi=10.1016/j.peva.2014.07.016}}</ref>。

算法从数组下标为 <math> MinSize </math> 开始循环，省去了处理长度不足最小分块所浪费的时间。计算当前字节下对应的指纹信息，当发现指纹信息等于提前预设好的 <math>Mask</math> 时，则进行分段处理。如果计算到最大的 <math> MaxSize </math> 时仍然没有匹配到 <math> Mask </math> 时，此时强行进行分块。


== 计算的复杂度 ==

所有滚动哈希函数在字符数上都是线性的，但其复杂度与窗口长度的关系（<math>k</math>）不等。Rabin–Karp滚动哈希需要两个<math>k</math>位数字的乘法，[[乘法|整数乘法]]是在<math>O(k \log k 2^{O(\log^*k)})</math>。<ref>M. Fürer, Faster integer multiplication, in: STOC ’07, 2007, pp. 57–66.</ref>通过循环多项式对[[N元语法|ngram]]进行哈希处理，可以在线性时间内完成。<ref name="lemirekaser"/>

== 软件 ==

* [https://github.com/lemire/rollinghashcpp rollinghashcpp] {{Wayback|url=https://github.com/lemire/rollinghashcpp |date=20201126204555 }}是几个滚动哈希函数的[[自由软件|免费软件]]C++实现
* [https://github.com/lemire/rollinghashjava rollinghashjava] {{Wayback|url=https://github.com/lemire/rollinghashjava |date=20201226053451 }}是滚动哈希函数的Apache许可Java实现
* [https://github.com/dpc/rdedup rdedup] {{Wayback|url=https://github.com/dpc/rdedup |date=20201124065446 }} 是 FastCDC 函数的 Rust 实现

== 参见 ==

* [[最小哈希]]

* {{tsl|en|w-shingling|w-重迭}}

== 外部链接 ==

* [http://courses.csail.mit.edu/6.006/spring11/rec/rec06.pdf MIT 6.006: Introduction to Algorithms 2011- Lecture Notes - Rolling Hash] {{Wayback|url=http://courses.csail.mit.edu/6.006/spring11/rec/rec06.pdf |date=20130816165110 }}

== 引用 ==

<references/>

[[Category:散列函数]]