C# 计算字符串的校验和
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9837732/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculate a checksum for a string
提问by jgauffin
I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
我得到了一个任意长度的字符串(比如 5 到 2000 个字符),我想为其计算校验和。
Requirements
要求
- The same checksum must be returned each time a calculation is done for a string
- The checksum must be unique (no collisions)
- I can not store previous IDs to check for collisions
- 每次对字符串进行计算时必须返回相同的校验和
- 校验和必须是唯一的(无冲突)
- 我无法存储以前的 ID 来检查冲突
Which algorithm should I use?
我应该使用哪种算法?
Update:
更新:
- Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
- The checksum should be alphanumeric
- The strings are unicode
- The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
- The length of the checksum is not important for me (the shorter, the better)
- 是否有一种合理独特的方法?即碰撞的可能性非常小。
- 校验和应该是字母数字
- 字符串是 unicode
- 字符串实际上是应该翻译的文本,并且校验和与每个翻译一起存储(因此翻译的文本可以与原始文本匹配)。
- 校验和的长度对我来说并不重要(越短越好)
Update2
更新2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
假设我得到了以下字符串"Welcome to this website. Navigate using the flashy but useless menu above"。
The string is used in a view in a similar way to gettextin linux. i.e. the user just writes (in a razor view)
字符串在视图中的使用方式与gettext在 linux中的使用方式类似。即用户只是写(在剃刀视图中)
@T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.
现在我需要一种方法来标识该字符串,以便我可以从数据源(数据源有多种实现)中获取它。必须使用整个字符串作为密钥似乎有点低效,因此我正在寻找一种方法来生成密钥。
采纳答案by Guffa
That's not possible.
那是不可能的。
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
如果您无法存储以前的值,则无法创建小于字符串中信息的唯一校验和。
Update:
更新:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
“合理独特”一词没有意义,要么独特,要么不独特。
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
为了将散列冲突的风险降低到合理的水平,您可以使用相当大的散列代码。
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5class, then convert the hash code byte array into a string using the BitConverterclass:
例如,MD5 算法产生一个 16 字节的哈希码。使用一些保留所有字符的编码将字符串转换为字节数组,例如 UTF-8,使用MD5类计算哈希码,然后使用类将哈希码字节数组转换为字符串BitConverter:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
输出:
912EC803B2CE49E4A541068D495AB570
回答by L.B
You can use cryptographic Hash functionsfor this. Most of them are available in .Net
您可以为此使用加密哈希函数。它们中的大多数都可以在 .Net 中找到
For example:
例如:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");
回答by David Heffernan
Note: This is an answer to the original question.
注意:这是对原始问题的回答。
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
假设您希望将校验和存储在固定大小的变量(即整数)中,则无法满足第二个约束。
The checksum must be unique (no collisions)
校验和必须是唯一的(无冲突)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.
您无法避免冲突,因为将有比可能的校验和值更多的不同字符串。
回答by cocogorilla
I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
我意识到这篇文章实际上很古老,但我偶然发现了它并在过去遇到了几乎相同的问题。我们有一个需要查找的 nvarchar(8000) 字段。
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
我们的解决方案是使用讨厌的查找字段的 CHECKSUM 创建一个持久的计算列。我们有一个自动递增的 ID 字段并键入 (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
从表中读取时,我们编写了一个过程,它获取查找文本,计算校验和,然后获取校验和相等且文本相等的位置。
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
您可以根据上述答案轻松地在应用程序级别执行校验和部分,并手动存储它们,而不是使用我们以数据库为中心的解决方案。但关键是获得一个合理大小的索引键,以便您的文本比较针对一系列冲突而不是整个数据集运行。
Good luck!
祝你好运!
回答by Mitchell E.
To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
为了保证唯一性,对于几乎无限大小的字符串,将可变长度字符串视为一组连接的子字符串,每个子字符串的长度为“x 个字符”。您的哈希函数只需要确定最大子串长度的唯一性,然后生成一系列生成值的校验和数字。将其视为具有一组校验和数字的等效网络 IP 地址。
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.
您的碰撞问题是假设碰撞会强制使用较慢的搜索方法来解决每次碰撞。如果与散列对象的数量相比,它们的可能冲突数量微不足道,那么作为一个整体,额外的开销变为 NIL。碰撞是由于表的大小小于对象的最大数量。这不一定是这种情况,因为表格可能有“洞”,并且表格中的每个对象都可能在该碰撞中具有对象的引用计数。仅当此计数大于 1 时,才会发生冲突或同一子串的多个实例。

