java 从唯一的字符串输入生成唯一的 id
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2194206/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Generate unique id from unique string input
提问by pkrish
I have a table with a column of unique string values. The max length of the string value is 255 char. I want to generate a unique id with the string value as input. In other words I am looking for a compact representation for a string. The unique id generated can be alpha-numeric. A useful feature to have would be to be able to regenerate the string value from the unique id.
我有一个包含唯一字符串值列的表。字符串值的最大长度为 255 个字符。我想用字符串值作为输入生成一个唯一的 id。换句话说,我正在寻找一个字符串的紧凑表示。生成的唯一 ID 可以是字母数字。一个有用的功能是能够从唯一的 id 重新生成字符串值。
Is there an efficient function to generate such an unique id. Some ways could be using checksum or hash functions. I want to know if there is a standard way to do this.
是否有一个有效的函数来生成这样一个唯一的 id。某些方法可以使用校验和或哈希函数。我想知道是否有标准的方法来做到这一点。
I am using MySql database and java.
我正在使用 MySql 数据库和 java。
Thanks!
谢谢!
--edit: I am looking for a more compact representation rather than just using the string itself.
--edit:我正在寻找更紧凑的表示,而不仅仅是使用字符串本身。
回答by Dagon
How unique is "unique"? Using any good hashing function (MD5 is decent for most uses, and easily implemented via java.security.MessageDigest.getInstance("MD5") can get you to a 128-bit number that's very very likely to be unique. Using a subset of the hash gets you a smaller ID, with a higher chance of collision.
“独一无二”到底有多独特?使用任何好的散列函数(MD5 适合大多数用途,并且可以通过 java.security.MessageDigest.getInstance("MD5") 轻松实现,可以让您获得一个很可能是唯一的 128 位数字。使用散列为您提供较小的 ID,冲突的可能性较高。
Using an auto_increment field in the DB, if it fits your design, might be easier to implement, will truly guarantee uniqueness, and will use smaller IDs than the 16 bytes of MD5. You can also then meet your requirement of finding the string by the key, which you can't do for a hash.
在数据库中使用 auto_increment 字段,如果它适合您的设计,可能更容易实现,将真正保证唯一性,并且将使用比 MD5 的 16 字节更小的 ID。然后,您还可以满足通过键查找字符串的要求,而对于散列则无法做到这一点。
回答by Bill K
This is related to compression. The simplest way would be to bit-pack and get each character down to the bare minimum number of bits.
这与压缩有关。最简单的方法是将每个字符压缩到最少的位数。
A-Z is 26 chars which is less than 32 (5 bits)
AZ 是 26 个字符,小于 32(5 位)
add a-z and it's 6 bits (with somewhere around 12 bit-patterns left over to represent other characters).
添加 az 并且它是 6 位(剩下大约 12 位模式来表示其他字符)。
Let's say that is enough for you. So you have 6x255 bits which is 1530 bits to store your string. (191 bytes)
让我们说这对你来说已经足够了。所以你有 6x255 位,也就是 1530 位来存储你的字符串。(191 字节)
Going with only caps would reduce that a little (to 159 bytes)
只使用大写会减少一点(到 159 字节)
You can optimize it more, but then you have to go into a compression algorithm that expects a specific language or patterns in the Strings and optimizes those patterns.
您可以对其进行更多优化,但随后您必须使用一种压缩算法,该算法需要字符串中的特定语言或模式并优化这些模式。
Unless you can further specify the contents of the strings, you're just not going to get what you want. Sorry. (If you can tell more about the contents of the strings, do so. One of us may see patterns that will allow much better "Compression")
除非您可以进一步指定字符串的内容,否则您将无法获得所需的内容。对不起。(如果您可以详细了解字符串的内容,请这样做。我们中的一个人可能会看到允许更好“压缩”的模式)
This lack of ability to do what you want is why hashtables are so cool. They get a "Mostly Unique" number and then have a second level of resolution to test cases where two strings hashed to the same number.
缺乏做你想做的事的能力就是哈希表如此酷的原因。他们得到一个“主要是唯一的”数字,然后有第二个级别的分辨率来测试两个字符串散列到相同数字的情况。
回答by FrustratedWithFormsDesigner
If your database requires that the column contain unique values, then why not use the string itself? Anything else is just another step to encode/decode it.
如果您的数据库要求该列包含唯一值,那么为什么不使用字符串本身呢?其他任何东西都只是编码/解码它的另一个步骤。
回答by Notinlist
You have much much more possibilities for a 255 long string than a 64 (or whatever) bit long number. It is impossible. Add an auto_increment field.
与 64(或其他)位长数字相比,255 长字符串的可能性要大得多。是不可能的。添加一个 auto_increment 字段。
回答by philfreo
Since you're using MySQL, take a look at CRC32
由于您使用的是 MySQL,请查看 CRC32
回答by Michael Sander
Choosing the proper key shouldnt be taken easy.
选择合适的密钥不应该那么容易。
You need to consider:
你需要考虑:
Replication: Is sharing of keys between different servers needed? If so, you most probably need some sort of unique hash or guid.
Size of the table/number of inserts: You should consider that most rdbms store the data physically on the hard drive by the order of their (clustered) primary key. Now imagine what happens, if you insert a hash value starting with 'a' on a table with a reasonable size. Yes, theres index padding, but eventually its full and the single line insert can cause the move of a couple of GB on the harddrive.
Need replication AND have big tables? Use both. Use a primary clustered auto increment (long)integer key and define a unique index on your hash column.
复制:是否需要在不同服务器之间共享密钥?如果是这样,您很可能需要某种独特的哈希或 guid。
表的大小/插入次数:您应该考虑到大多数 rdbms 按其(集群)主键的顺序将数据物理存储在硬盘驱动器上。现在想象一下,如果您在具有合理大小的表上插入以 'a' 开头的哈希值,会发生什么。是的,有索引填充,但最终它已满,单行插入会导致硬盘驱动器上移动几 GB。
需要复制并且有大表?两者都用。使用主聚集自动增量(长)整数键并在散列列上定义唯一索引。
回答by Henning
If you have a limited number of strings that occur frequently, creating a reference table with a numeric (auto-increment) ID, and a FK to that reference table in your main table could be an option.
如果频繁出现的字符串数量有限,则可以选择创建一个带有数字(自动递增)ID 的引用表,以及对该主表中该引用表的 FK。
If not, you could run your strings through GZIP or any other compression algorithm if you need to retrieve the original.
如果没有,如果您需要检索原始字符串,您可以通过 GZIP 或任何其他压缩算法运行您的字符串。
If you don't need to retrieve the original, a hash function such as MD5 is what you're looking for.
如果您不需要检索原始文件,那么您正在寻找诸如 MD5 之类的哈希函数。
回答by Sean
public String getUniqueId(String uniqueString) {
return uniqueString;
}
Unless the ID has any other constraints on it than "be unique".
除非 ID 对它有任何其他限制而不是“是唯一的”。

