performance 获取字符串的 int 表示

Question

提问by jase

I am looking for a way to create an int\long representation of an arbitrary alpha-numeric String. Hash codes won't do it, because I can't afford hash collisions i.e. the representation must be unique and repeatable.

我正在寻找一种方法来创建任意字母数字字符串的 int\long 表示。哈希码不会这样做，因为我不能承受哈希冲突，即表示必须是唯一的和可重复的。

The numeric representation will be used to perform efficient (hopefully) compares. The creation of the numeric key will take some time, but it only has to happen once, whereas I need to perform vast numbers of comparisons with it - which will hopefully be much faster than comparing the raw Strings.

数字表示将用于执行有效（希望）比较。数字键的创建需要一些时间，但它只需要发生一次，而我需要与它进行大量比较 - 这有望比比较原始字符串快得多。

Any other idea's on faster String comparison will be most appreciated too...

关于更快的字符串比较的任何其他想法也将非常受欢迎......

Answer 1

回答by Don Kirkby

Unless your string is limited in length, you can't avoid collisions.

除非您的字符串长度有限，否则您无法避免冲突。

There are 4294967296 possible values for an integer (2^32). If you have a string of more than 4 ASCII characters, or more than two unicode characters, then there are more possible string values than possible integer values. You can't have a unique integer value for every possible 5 character string. Long values have more possible values, but they would only provide a unique value for every possible string of 8 ASCII characters.

一个整数 (2^32) 有 4294967296 个可能的值。如果您有超过 4 个 ASCII 字符或超过两个 unicode 字符的字符串，则可能的字符串值多于可能的整数值。对于每个可能的 5 个字符的字符串，您不能有一个唯一的整数值。长值有更多可能的值，但它们只会为每个可能的 8 个 ASCII 字符字符串提供唯一值。

Hash codes are useful as a two step process: first see if the hash code matches, then check the whole string. For most strings that don't match, you only need to do the first step, and it's really fast.

哈希码可用作两步过程：首先查看哈希码是否匹配，然后检查整个字符串。对于大多数不匹配的字符串，你只需要做第一步，真的很快。

Answer 2

回答by Patrick McElhaney

Can't you just start with a hash code, and if the hash codes match, do a character by character comparison?

难道你不能只从一个哈希码开始，如果哈希码匹配，就逐个字符地进行比较吗？

Answer 3

回答by Konrad Rudolph

How long are the strings? If they are very short, then a unique ID can be generated by considering the characters as digits in base 36 (26 + 10) that form a n-digits number where nis the length of the string. On the other hand, if the strings are short enough to allow this, direct comparison won't be an issue anyway.

弦有多长？如果它们很短，则可以通过将字符视为基数 36 (26 + 10) 中的数字来生成唯一 ID，这些数字形成n位数字，其中n是字符串的长度。另一方面，如果字符串足够短以允许这样做，那么直接比较无论如何都不会成为问题。

Otherwise you'll have to generate a collision-free hash and this can only be done when the complete problem space is known in advance (i.e. if you know all strings that can possibly occur). You will want to have a look at perfect hashing, although the only feasible algorithm to find a perfect hash function that I know is probabilistic so collisions are still theoretically possible.

否则，您将不得不生成一个无冲突的散列，这只能在预先知道完整的问题空间时才能完成（即，如果您知道可能出现的所有字符串）。你会想看看完美散列，虽然我知道找到完美散列函数的唯一可行算法是概率性的，所以理论上仍然可能发生冲突。

There might be other ways to find such a function. Knuth called this a “rather amusing …?puzzle” in TAoCP but he doesn't give an algorithm either.

可能还有其他方法可以找到这样的函数。Knuth 在 TAoCP 中称这是一个“相当有趣的……？谜题”，但他也没有给出算法。

In general, you give way too few information to find an algorithm that doesn't require probing the whole problem space in some manner. This does invariably mean that the problem has exponential running time but could be solved using machine-learning heuristics. I'm not sure if this is advisable in your case.

通常，您提供的信息太少，无法找到不需要以某种方式探索整个问题空间的算法。这确实意味着问题的运行时间呈指数级增长，但可以使用机器学习启发式方法解决。我不确定这在您的情况下是否可取。

Answer 4

回答by toolkit

Perhaps:

也许：

String y = "oiu291981u39u192u3198u389u28u389u";
BigInteger bi = new BigInteger(y, 36);
System.out.println(bi);

Answer 5

回答by Adam Davis

At the end of the day, a single alphanumeric character has at least 36 possible values. If you include punctuation, lower case, etc then you can easily pass 72 possible values.

归根结底，单个字母数字字符至少有 36 个可能的值。如果包含标点符号、小写字母等，则可以轻松传递 72 个可能的值。

A non-colliding number that allows you to quickly compare strings would necessarily grow exponentially with the length of the string.

允许您快速比较字符串的非冲突数字必然会随着字符串的长度呈指数增长。

So you firstmust decide on the longest string you are expecting to compare. Assuming it's N characters in length, and assuming you ONLY need uppercase letters and the numerals 0-9 then you need to have an integer representation that can be as high as 36^N

因此，您首先必须决定要比较的最长字符串。假设它的长度是 N 个字符，并且假设您只需要大写字母和数字 0-9，那么您需要有一个可以高达 36^N 的整数表示

For a string of length 25 (common name field) then you end up needing a binary number with 130 bits.

对于长度为 25（通用名称字段）的字符串，您最终需要一个 130 位的二进制数。

If you compose that into 32 bit numbers, you'll need 4. Then you can compare each number (four integer compares should take no time, compared to walking the string). I would recommend a big number library, but for this specialized case I'm pretty sure you can write your own and get better performance.

如果你把它组合成 32 位数字，你需要 4。然后你可以比较每个数字（与遍历字符串相比，四个整数比较应该不需要时间）。我会推荐一个大数字库，但对于这种特殊情况，我很确定您可以编写自己的库并获得更好的性能。

If you want to handle 72 possible values per character (uppercase, lowercase, numerals, punctuation...) and you need 10 characters, then you'll need 62 bits - two 32 bit integers (or one 64 bit if you're on a system that supports 64 bit computing)

如果您想处理每个字符 72 个可能的值（大写、小写、数字、标点符号...）并且您需要 10 个字符，那么您将需要 62 位 - 两个 32 位整数（或者一个 64 位，如果您在支持64位计算的系统）

If, however, you are not able to restrict the numbers in the string (ie, could be any of the 256 letters/numbers/characters/etc) and you can't define the size of the string, then comparing the strings directly is the only way to go, but there's a shortcut.

但是，如果您无法限制字符串中的数字（即，可以是 256 个字母/数字/字符/等中的任何一个）并且您无法定义字符串的大小，则直接比较字符串是唯一的办法，但有一条捷径。

Cast the pointer of the string to a 32 bit unsigned integer array, and compare the string 4 bytes at a time (or 64 bits/8bytes at a time on a 64 bit processor). This means that a 100 character string only requires 25 compares maximum to find which is greater.

将字符串的指针转换为 32 位无符号整数数组，并一次比较字符串 4 个字节（或在 64 位处理器上一次比较 64 位/8 个字节）。这意味着 100 个字符的字符串最多只需要 25 次比较就可以找到哪个更大。

You may need to re-define the character set (and convert the strings) so that the characters with higher precedence are assigned values closer to 0, and lower precedence values closer to 255 (or vice versa, depending on how you are comparing them).

您可能需要重新定义字符集（并转换字符串），以便为具有较高优先级的字符分配更接近 0 的值，将较低优先级值分配到接近 255 的值（反之亦然，取决于您如何比较它们） .

Good luck!

祝你好运！

-Adam

-亚当

Answer 6

回答by ckpwong

As long as it's a hash function, be it String.hashCode(), MD5 or SHA1, collision is unavoidable unless you have a fixed limit on the string's length. It is mathematically impossible to have one-to-one mapping from an infinite group to a finite group.

只要它是一个散列函数，无论是 String.hashCode()、MD5 还是 SHA1，除非您对字符串的长度有固定限制，否则冲突是不可避免的。从无限群到有限群的一对一映射在数学上是不可能的。

Stepping back, is collision avoidance absolutelynecessary?

退一步，避碰是绝对必要的吗？

Answer 7

回答by Grzegorz Gierlik

A few questions in the beginning:

开头的几个问题：

Did you test that simple string comparison is too slow?
How the comparison looks like ('ABC' == 'abc' or 'ABC' != 'abc')?
How many string do you have to compare?
How many comparison do you have to do?
How your strings look like (the length, letter case)?

您是否测试过简单的字符串比较太慢？
比较结果如何（'ABC' == 'abc' 或 'ABC' != 'abc'）？
你要比较多少个字符串？
你要做多少比较？
你的字符串是什么样子的（长度，字母大小写）？

As far as I remember String in Java is an object and two identical strings point to the same object.

据我所知，Java 中的 String 是一个对象，两个相同的字符串指向同一个对象。

So, maybe it would be enough to compare objects (probably string comparison is already implemented in this way).

所以，也许比较对象就足够了（可能字符串比较已经以这种方式实现了）。

If it doesn't help you can try to use Pascal implementation of string object when first element is length and if your strings have various length this should save some CPU time.

如果它没有帮助，您可以尝试在第一个元素是长度时使用字符串对象的 Pascal 实现，并且如果您的字符串具有不同的长度，这应该可以节省一些 CPU 时间。

Answer 8

回答by Chris Upchurch

How long are your strings? Unless you choose an int representation that's longer than the string, collisions will always be possible no matter what conversion you're using. So if you're using a 32 bit integer, you can only uniquely represent strings of up to 4 bytes.

你的弦有多长？除非您选择一个比字符串长的 int 表示，否则无论您使用什么转换，冲突总是可能的。因此，如果您使用的是 32 位整数，则只能唯一地表示最多 4 个字节的字符串。

Answer 9

回答by Pramod

How big are your strings? Arbitrarily long strings cannot be compressed into 32/64 bit format.

你的弦有多大？任意长的字符串不能压缩成 32/64 位格式。

Answer 10

回答by Thomas Owens

If you don't want collisions, try something insane like SHA-512. I can't guarantee there won't be collisions, but I don't think they have found any yet.

如果您不想发生冲突，请尝试像 SHA-512 这样的疯狂方法。我不能保证不会发生碰撞，但我认为他们还没有发现任何碰撞。

performance 获取字符串的 int 表示

提问by jase

回答by Don Kirkby

回答by Patrick McElhaney

回答by Konrad Rudolph

回答by toolkit

回答by Adam Davis

回答by ckpwong

回答by Grzegorz Gierlik

回答by Chris Upchurch

回答by Pramod

回答by Thomas Owens

相关推荐

最近更新

标签

performance 获取字符串的 int 表示

提问by jase

回答by Don Kirkby

回答by Patrick McElhaney

回答by Konrad Rudolph

回答by toolkit

回答by Adam Davis

回答by ckpwong

回答by Grzegorz Gierlik

回答by Chris Upchurch

回答by Pramod

回答by Thomas Owens

相关推荐

Firebase 中的记录总数（我什么时候数完？）

在函数中处理异步调用 (Firebase)

Firebase - 在建立连接之前关闭 WebSocket。

Firebase child_ added 只添加孩子

相关推荐

最近更新

标签