Java 和 Python 程序的相同一致性哈希算法实现

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12362417/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 08:34:32  来源:igfitidea点击:

Same consistent-hashing algorithm implementation for Java and Python program

javapythonconsistent-hashing

提问by superche

We have an app that the Python module will write data to redis shards and the Java module will read data from redis shards, so I need to implement the exact same consistent hashing algorithm for Java and Python to make sure the data can be found.

I googled around and tried several implementations, but found the Java and Python implementations are always different, can't be used togather. Need your help.

我们有一个应用程序,Python 模块将数据写入 redis 分片,Java 模块将从 redis 分片读取数据,因此我需要为 Java 和 Python 实现完全相同的一致哈希算法,以确保可以找到数据。

google了一番,尝试了几个实现,发现Java和Python的实现总是不一样,不能混用。需要你的帮助。

Edit, online implementations I have tried:
Java: http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html
Python: http://techspot.zzzeek.org/2012/07/07/the-absolutely-simplest-consistent-hashing-example/
http://amix.dk/blog/post/19367

编辑,我尝试过的在线实现:
Java:http
: //weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.htmlPython:http: //techspot.zzzeek.org/2012/07/07 /the-absolutely-simplest-consistent-hashing-example/
http://amix.dk/blog/post/19367

Edit, attached Java (Google Guava lib used) and Python code I wrote. Code are based on the above articles.

编辑,附加 Java(使用 Google Guava lib)和我编写的 Python 代码。代码基于以上文章。

import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
import com.google.common.hash.HashFunction;

public class ConsistentHash<T> {
    private final HashFunction hashFunction;
    private final int numberOfReplicas;
    private final SortedMap<Long, T> circle = new TreeMap<Long, T>();

    public ConsistentHash(HashFunction hashFunction, int numberOfReplicas,
            Collection<T> nodes) {
        this.hashFunction = hashFunction;
        this.numberOfReplicas = numberOfReplicas;

        for (T node : nodes) {
            add(node);
        }
    }

    public void add(T node) {
        for (int i = 0; i < numberOfReplicas; i++) {
            circle.put(hashFunction.hashString(node.toString() + i).asLong(),
                    node);
        }
    }

    public void remove(T node) {
        for (int i = 0; i < numberOfReplicas; i++) {
            circle.remove(hashFunction.hashString(node.toString() + i).asLong());
        }
    }

    public T get(Object key) {
        if (circle.isEmpty()) {
            return null;
        }
        long hash = hashFunction.hashString(key.toString()).asLong();
        if (!circle.containsKey(hash)) {
            SortedMap<Long, T> tailMap = circle.tailMap(hash);
            hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
        }
        return circle.get(hash);
    }
}

Test code:

测试代码:

        ArrayList<String> al = new ArrayList<String>(); 
        al.add("redis1");
        al.add("redis2");
        al.add("redis3");
        al.add("redis4");

        String[] userIds = 
        {"-84942321036308",
        "-76029520310209",
        "-68343931116147",
        "-54921760962352"
        };
        HashFunction hf = Hashing.md5();

        ConsistentHash<String> consistentHash = new ConsistentHash<String>(hf, 100, al); 
        for (String userId : userIds) {
            System.out.println(consistentHash.get(userId));
        }

Python code:

蟒蛇代码:

import bisect
import md5

class ConsistentHashRing(object):
    """Implement a consistent hashing ring."""

    def __init__(self, replicas=100):
        """Create a new ConsistentHashRing.

        :param replicas: number of replicas.

        """
        self.replicas = replicas
        self._keys = []
        self._nodes = {}

    def _hash(self, key):
        """Given a string key, return a hash value."""

        return long(md5.md5(key).hexdigest(), 16)

    def _repl_iterator(self, nodename):
        """Given a node name, return an iterable of replica hashes."""

        return (self._hash("%s%s" % (nodename, i))
                for i in xrange(self.replicas))

    def __setitem__(self, nodename, node):
        """Add a node, given its name.

        The given nodename is hashed
        among the number of replicas.

        """
        for hash_ in self._repl_iterator(nodename):
            if hash_ in self._nodes:
                raise ValueError("Node name %r is "
                            "already present" % nodename)
            self._nodes[hash_] = node
            bisect.insort(self._keys, hash_)

    def __delitem__(self, nodename):
        """Remove a node, given its name."""

        for hash_ in self._repl_iterator(nodename):
            # will raise KeyError for nonexistent node name
            del self._nodes[hash_]
            index = bisect.bisect_left(self._keys, hash_)
            del self._keys[index]

    def __getitem__(self, key):
        """Return a node, given a key.

        The node replica with a hash value nearest
        but not less than that of the given
        name is returned.   If the hash of the
        given name is greater than the greatest
        hash, returns the lowest hashed node.

        """
        hash_ = self._hash(key)
        start = bisect.bisect(self._keys, hash_)
        if start == len(self._keys):
            start = 0
        return self._nodes[self._keys[start]]

Test code:

测试代码:

import ConsistentHashRing

if __name__ == '__main__':
    server_infos = ["redis1", "redis2", "redis3", "redis4"];
    hash_ring = ConsistentHashRing()
    test_keys = ["-84942321036308",
        "-76029520310209",
        "-68343931116147",
        "-54921760962352",
        "-53401599829545"
        ];

    for server in server_infos:
        hash_ring[server] = server

    for key in test_keys:
        print str(hash_ring[key])

采纳答案by lvc

You seem to be running into two issues simultaneously: encoding issues and representation issues.

您似乎同时遇到了两个问题:编码问题和表示问题。

Encoding issues come about particularly since you appear to be using Python 2 - Python 2's strtype is not at all like Java's Stringtype, and is actually more like a Java array of byte. But Java's String.getBytes()isn't guaranteed to give you a byte array with the same contents as a Python str(they probablyuse compatible encodings, but aren't guaranteed to - even if this fix doesn't change things, it's a good idea in general to avoid problems in the future).

编码问题尤其出现,因为您似乎在使用 Python 2 - Python 2 的str类型根本不像 Java 的String类型,实际上更像是byte. 但是 JavaString.getBytes()不能保证为您提供与 Python 具有相同内容的字节数组str(它们可能使用兼容的编码,但不能保证 - 即使此修复程序不会改变事情,一般来说也是个好主意)避免将来出现问题)。

So, the way around this is to use a Python type that behaves like Java's String, and convert the corresponding objects from both languages to bytes specifying the same encoding. From the Python side, this means you want to use the unicodetype, which is the default string literal type if you are using Python 3, or put this near the top of your .py file:

因此,解决此问题的方法是使用行为类似于 Java 的 Python 类型,String并将两种语言的相应对象转换为指定相同编码的字节。从 Python 方面来看,这意味着您要使用unicode类型,如果您使用的是 Python 3,这是默认的字符串文字类型,或者将其放在 .py 文件的顶部附近:

from __future__ import unicode_literals

If neither of those is an option, specify your string literals this way:

如果这些都不是一个选项,请以这种方式指定您的字符串文字:

u'text'

The uat the front forces it to unicode. This can then be converted to bytes using its encodemethod, which takes (unsurprisingly) an encoding:

u它前面的力量为Unicode。然后可以使用其encode方法将其转换为字节,该方法采用(不出所料)编码:

u'text'.encode('utf-8')

From the Java side, there is an overloaded version of String.getBytesthat takes an encoding - but it takes it as a java.nio.Charsetrather than a string - so, you'll want to do:

从 Java 方面来看,有一个重载版本,String.getBytes它需要一个编码——但它把它作为一个java.nio.Charset而不是一个字符串——所以,你会想要做:

"text".getBytes(java.nio.charset.Charset.forName("UTF-8"))

These will give you equivalent sequences of bytes in both languages, so that the hashes have the same input and will give you the same answer.

这些将为您提供两种语言中等效的字节序列,以便散列具有相同的输入并为您提供相同的答案。

The other issue you may have is representation, depending on which hash function you use. Python's hashlib(which is the preferred implementation of md5 and other cryptographic hashes since Python 2.5) is exactly compatible with Java's MessageDigestin this - they both give bytes, so their output should be equivalent.

您可能遇到的另一个问题是表示,具体取决于您使用的哈希函数。Python hashlib(这是自 Python 2.5 以来 md5 和其他加密哈希的首选实现)MessageDigest在这方面与 Java 完全兼容- 它们都提供字节,因此它们的输出应该是等效的。

Python's zlib.crc32and Java's java.util.zip.CRC32, on the other hand, both give numeric results - but Java's is always an unsigned 64 bit number, while Python's (in Python 2) is a signed 32 bit number (in Python 3, its now an unsigned 32-bit number, so this problem goes away). To convert a signed result to an unsigned one, do: result & 0xffffffff, and the result should be comparable to the Java one.

另一方面,Pythonzlib.crc32和 Javajava.util.zip.CRC32都给出数字结果——但 Java 总是一个无符号的 64 位数字,而 Python(在 Python 2 中)是一个有符号的 32 位数字(在 Python 3 中,它现在是一个无符号的 32 位数字,所以这个问题就消失了)。要将有符号的结果转换为无符号的结果,请执行:result & 0xffffffff,结果应该与 Java 的结果相当。

回答by Brendan Long

According to this analysis of hash functions:

根据对哈希函数的分析

Murmur2, Meiyan, SBox, and CRC32 provide good performance for all kinds of keys. They can be recommended as general-purpose hashing functions on x86.

Hardware-accelerated CRC (labeled iSCSI CRC in the table) is the fastest hash function on the recent Core i5/i7 processors. However, the CRC32 instruction is not supported by AMD and earlier Intel processors.

Murmur2、Meiyan、SBox 和 CRC32 为各种键提供了良好的性能。它们可以被推荐为 x86 上的通用哈希函数。

硬件加速 CRC(在表中标记为 iSCSI CRC)是最近的 Core i5/i7 处理器上最快的哈希函数。但是,AMD 和更早的 Intel 处理器不支持 CRC32 指令。

Python has zlib.crc32and Java has a CRC32 class. Since it's a standard algorithm, you should get the same result in both languages.

Python 有zlib.crc32而 Java 有一个CRC32 类。由于它是标准算法,因此您应该在两种语言中获得相同的结果。

MurmurHash 3 is available in Google Guava(a very useful Java library) and in pyfasthashfor Python.

MurmurHash 3在 Google Guava(一个非常有用的 Java 库)和Python 的pyfasthash 中可用。

Note that these aren't cryptographic hash functions, so they're fast but don't provide the same guarantees. If these hashes are important for security, use a cryptographic hash.

请注意,这些不是加密哈希函数,因此它们速度很快但不提供相同的保证。如果这些散列对安全很重要,请使用加密散列。

回答by Chris Bode

I'm not familiar with Redis, but the Python example appears to be hashing keys, so I'm assuming we're talking about some sort of HashMap implementation.

我不熟悉 Redis,但 Python 示例似乎是散列键,所以我假设我们正在谈论某种 HashMap 实现。

Your python example appears to be using MD5 hashes, which will be the same in both Java and Python.

您的 python 示例似乎使用 MD5 哈希,这在 Java 和 Python 中都是相同的。

Here is an example of MD5 hashing in Java:

以下是 Java 中 MD5 散列的示例:

http://www.dzone.com/snippets/get-md5-hash-few-lines-java

http://www.dzone.com/snippets/get-md5-hash-few-lines-java

And in Python:

在 Python 中:

http://docs.python.org/library/md5.html

http://docs.python.org/library/md5.html

Now, you may want to find a faster hashing algorithm. MD5 is focused on cryptographic security, which isn't really needed in this case.

现在,您可能想找到一种更快的散列算法。MD5 专注于加密安全,这在这种情况下并不是真正需要的。

回答by basiljames

Differnt language implementations of a hashing algorithm does not make the hash value different. The SHA-1hash whether generated in java or python will be the same.

散列算法的不同语言实现不会使散列值不同。SHA-1无论是在 java 还是 python 中生成的哈希都是一样的。

回答by Chris Bode

Here is a simple hashing function that produces the same result on both python and java for your keys:

这是一个简单的散列函数,它在 python 和 java 上为您的密钥生成相同的结果:

Python

Python

def hash(key):
        h = 0
        for c in key:
                h = ((h*37) + ord(c)) & 0xFFFFFFFF
        return h;

Java

爪哇

public static int hash(String key) {
    int h = 0;
    for (char c : key.toCharArray())
        h = (h * 37 + c) & 0xFFFFFFFF;
    return h;
}

You don't need a cryptographically secure hash for this. That's just overkill.

为此,您不需要加密安全的哈希。那简直是矫枉过正。

回答by Dr. Jan-Philip Gehrcke

Let's get this straight: the same binaryinput to the same hash function (SHA-1, MD5, ...) in different environments/implementations (Python, Java, ...) will yield the same binaryoutput. That's because these hash functions are implemented according to standards.

让我们这个直:同样的二进制输入到不同的环境相同的散列函数(SHA-1,MD5,...)/实现(Python和Java中,...)会产生相同的二进制输出。那是因为这些哈希函数是按照标准实现的。

Hence, you will discover the sources of the problem(s) you experience when answering these questions:

因此,您将发现在回答这些问题时遇到的问题的根源:

  • do you provide the same binary input to both hash functions (e.g. MD5 in Python and Java)?

  • do you interpret the binary output of both hash functions (e.g. MD5 in Python and Java) equivalently?

  • 您是否为两个哈希函数(例如 Python 和 Java 中的 MD5)提供相同的二进制输入?

  • 您是否等效地解释了两个哈希函数(例如 Python 和 Java 中的 MD5)的二进制输出?

@lvc's answer provides much more detail on these questions.

@lvc 的回答提供了有关这些问题的更多详细信息。

回答by Q i

For the java version, I would recommend using MD5 which generates 128bit string result and it can then be converted into BigInteger (Integer and Long are not enough to hold 128bit data).

对于 java 版本,我建议使用 MD5,它生成 128 位字符串结果,然后可以将其转换为 BigInteger(Integer 和 Long 不足以容纳 128 位数据)。

Sample code here:

示例代码在这里:

private static class HashFunc {

    static MessageDigest md5;

    static {
        try {
            md5 = MessageDigest.getInstance("MD5");
        } catch (NoSuchAlgorithmException e) {
            //
        }
    }

    public synchronized int hash(String s) {
        md5.update(StandardCharsets.UTF_8.encode(s));
        return new BigInteger(1, md5.digest()).intValue();
    }
}

Note that:

注意:

The java.math.BigInteger.intValue() converts this BigInteger to an int. This conversion is analogous to a narrowing primitive conversion from long to int. If this BigInteger is too big to fit in an int, only the low-order 32 bits are returned. This conversion can lose information about the overall magnitude of the BigInteger value as well as return a result with the opposite sign.

java.math.BigInteger.intValue() 将此 BigInteger 转换为 int。这种转换类似于从 long 到 int 的缩小原始转换。如果此BigInteger 太大而无法放入 int,则仅返回低 32 位。此转换可能会丢失有关 BigInteger 值的整体大小的信息,并会返回具有相反符号的结果。