C++ 滚动哈希的快速实现

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/711770/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 16:53:29  来源:igfitidea点击:

Fast implementation of Rolling hash

c++hash

提问by

I need a Rolling hash to search for patterns in a file. (I am trying to use the Rabin-Karp string search algorithm).

我需要一个滚动哈希来搜索文件中的模式。(我正在尝试使用Rabin-Karp 字符串搜索算法)。

I understand how a good Hashworks and how a good Rolling Hashshould work but I am unable to figure out how to efficiently implement the divide(or inverse multiplication) when rolling the hash. I also read rsync uses rolling version of adler32but that doesn't looks like a random enough hash.

我了解一个好的散列是如何工作的以及一个好的滚动散列应该如何工作,但我无法弄清楚在滚动散列时如何有效地实现除法(或逆乘法)。我还读到 rsync 使用了adler32 的滚动版本,但这看起来不像是一个足够随机的散列。

Ideally it will be great if you can point me to an optimized C/C++ implementation, but any pointers in the right direction will help.

理想情况下,如果您能指出我优化的 C/C++ 实现,那将会很棒,但任何指向正确方向的指针都会有所帮助。

采纳答案by v3.

Cipher's "prime base" idea should work decently - though the solution he posted looks a bit sketchy.

Cipher 的“主要基础”想法应该很有效——尽管他发布的解决方案看起来有点粗略。

I don't think there's any need for inverse multiplication in this method. Here's my solution:

我认为这种方法不需要逆乘法。这是我的解决方案:

Say the string we currently have hashed is "abc", and we want to append "d" and remove "a".

假设我们当前散列的字符串是“abc”,我们想要附加“d”并删除“a”。

Just like Cipher, my basic hash algorithm will be:

就像 Cipher 一样,我的基本哈希算法是:

unsigned hash(const string& s)
{
    unsigned ret = 0;
    for (int i = 0; i < s.size(); i++)
    {
        ret *= PRIME_BASE; //shift over by one
        ret += s[i]; //add the current char
        ret %= PRIME_MOD; //don't overflow
    }
    return ret;
}

Now, to implement sliding:

现在,要实现滑动:

hash1 = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1]

We'd like to add something at the end and remove the first value, so

我们想在最后添加一些东西并删除第一个值,所以

hash2 = [1]*base^(n-1) + [2]*base^(n-2) + ... + [n]

First we can add the last letter:

首先我们可以添加最后一个字母:

hash2 = (hash1 * PRIME_BASE) + newchar;
=> [0]*base^n + [1]*base^(n-1) + ... + [n-1]*base + [n]

Then simply subtract the first character:

然后简单地减去第一个字符:

hash2 -= firstchar * pow(base, n);
=> [1]*base^(n-1) + ... + [n]

An important note: you have to be careful about overflow. You can choose to just let it overflow unsigned int, but I think it's much more prone to collision (but also faster!)

一个重要的提示:你必须小心溢出。您可以选择让它溢出 unsigned int,但我认为它更容易发生冲突(但也更快!)

Here's my implementation:

这是我的实现:

#include <iostream>
#include <string>
using namespace std;

const unsigned PRIME_BASE = 257;
const unsigned PRIME_MOD = 1000000007;

unsigned hash(const string& s)
{
    long long ret = 0;
    for (int i = 0; i < s.size(); i++)
    {
        ret = ret*PRIME_BASE + s[i];
        ret %= PRIME_MOD; //don't overflow
    }
    return ret;
}

int rabin_karp(const string& needle, const string& haystack)
{
    //I'm using long longs to avoid overflow
    long long hash1 = hash(needle);
    long long hash2 = 0;

    //you could use exponentiation by squaring for extra speed
    long long power = 1;
    for (int i = 0; i < needle.size(); i++)
        power = (power * PRIME_BASE) % PRIME_MOD;

    for (int i = 0; i < haystack.size(); i++)
    {
        //add the last letter
        hash2 = hash2*PRIME_BASE + haystack[i];
        hash2 %= PRIME_MOD;

        //remove the first character, if needed
        if (i >= needle.size())
        {
            hash2 -= power * haystack[i-needle.size()] % PRIME_MOD;
            if (hash2 < 0) //negative can be made positive with mod
                hash2 += PRIME_MOD;
        }

        //match?
        if (i >= needle.size()-1 && hash1 == hash2)
            return i - (needle.size()-1);
    }

    return -1;
}

int main()
{
    cout << rabin_karp("waldo", "willy werther warhol wendy --> waldo <--") << endl;
}

回答by obecalp

Some pointers for a fast implementation:

一些快速实现的指针:

  1. Avoid modulo n operation (% in C like languages) use mask n - 1, where n is 2^k, include the operations for the hash table lookup. Yes, it's possible to produce good hash with a non-prime moduli.
  2. Pick multipliers and exponents with good figures of merit, see this paperfor details.
  1. 避免模​​ n 运算(C 类语言中的 %)使用掩码 n - 1,其中 n 是 2^k,包括哈希表查找的操作。是的,可以使用非素数模产生良好的散列。
  2. 选择具有良好品质因数的乘数和指数,有关详细信息,请参阅本文

回答by Jake

I wrote this a while back. Its written in c# but that is very close to c, you will only have to add a couple of parameters. This shouldwork but I haven't test this version, I removed a couple lines that would ignore case or non-word chars. I hope this helps

我前阵子写过这个。它是用 c# 编写的,但与 c 非常接近,您只需添加几个参数。这应该可以工作,但我还没有测试过这个版本,我删除了几行会忽略大小写或非单词字符的行。我希望这有帮助

private const int primeBase = 101;
//primeBase^2*[0]+primeBase^1*[1]+primeBase^0*[2]
//==
//primeBase*(primeBase*[0]+[1])+[2]
public static int primeRollingHash(String input, int start, int end)
{
    int acc = 0;
    for (int i = start; i <= end; i++)
    {
        char c = input[i];
        acc *= primeBase;
        acc += c;
    }
    return acc;
}

public static int primeRollingHash(String input)
{
    return primeRollingHash(input, 0, input.Length - 1);
}

public static int rollHashRight(int currentHashValue, String input, 
                                int start, int newEnd)
{
    if (newEnd == input.Length)
        return currentHashValue;
    int length = newEnd - start - 1;
    int multiplier = primeBase;
    char newChar = input[newEnd];
    int firstValue = input[start];
    if(length>0)
        firstValue *= length * primeBase;
    return (currentHashValue - firstValue) * multiplier + newChar;
}