C#字符串的GetHashCode()是如何实现的?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15174477/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How is GetHashCode() of C# string implemented?
提问by Louis Rhys
I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.
我只是好奇,因为我猜它会对性能产生影响。它会考虑完整的字符串吗?如果是,则在长字符串上会很慢。如果只考虑字符串的一部分,则性能会很差(例如,如果只考虑字符串的开头,如果 HashSet 包含大部分相同的字符串,则性能会很差。
采纳答案by Hans Passant
Be sure to obtain the Reference Source source codewhen you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs
当您有这样的问题时,请务必获取参考源代码。它比您从反编译器中看到的要多得多。选择一个与您首选的 .NET 目标相匹配的方法,版本之间的方法已经发生了很大变化。我将在这里重现它的 .NET 4.5 版本,从 Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs 检索
public override int GetHashCode() {
#if FEATURE_RANDOMIZED_STRING_HASHING
if(HashHelpers.s_UseRandomizedStringHashing)
{
return InternalMarvin32HashString(this, this.Length, 0);
}
#endif // FEATURE_RANDOMIZED_STRING_HASHING
unsafe {
fixed (char *src = this) {
Contract.Assert(src[this.Length] == '// string
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public unsafe override int GetHashCode()
{
IntPtr arg_0F_0;
IntPtr expr_06 = arg_0F_0 = this;
if (expr_06 != 0)
{
arg_0F_0 = (IntPtr)((int)expr_06 + RuntimeHelpers.OffsetToStringData);
}
char* ptr = arg_0F_0;
int num = 352654597;
int num2 = num;
int* ptr2 = (int*)ptr;
for (int i = this.Length; i > 0; i -= 4)
{
num = ((num << 5) + num + (num >> 27) ^ *ptr2);
if (i <= 2)
{
break;
}
num2 = ((num2 << 5) + num2 + (num2 >> 27) ^ ptr2[(IntPtr)4 / 4]);
ptr2 += (IntPtr)8 / 4;
}
return num + num2 * 1566083941;
}
', "src[this.Length] == '\0'");
Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");
#if WIN32
int hash1 = (5381<<16) + 5381;
#else
int hash1 = 5381;
#endif
int hash2 = hash1;
#if WIN32
// 32 bit machines.
int* pint = (int *)src;
int len = this.Length;
while (len > 2)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
pint += 2;
len -= 4;
}
if (len > 0)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
}
#else
int c;
char *s = src;
while ((c = s[0]) != 0) {
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}
#endif
#if DEBUG
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
#endif
return hash1 + (hash2 * 1566083941);
}
}
}
This is possibly more than you bargained for, I'll annotate the code a bit:
这可能比你预想的要多,我会稍微注释一下代码:
- The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
- The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entryin the app.exe.config file
- The fixedstatement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
- The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
- The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
- The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at allthe characters in the string, answering your question
- The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
- The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
- The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.
- #if 条件编译指令使此代码适应不同的 .NET 目标。FEATURE_XX 标识符在别处定义并关闭整个 .NET 源代码中的功能。当目标是32位版本的框架时定义WIN32,64位版本的mscorlib.dll是单独构建的,存放在GAC的不同子目录中。
- s_UseRandomizedStringHashing 变量启用了散列算法的安全版本,旨在让程序员避免使用 GetHashCode() 为密码或加密等内容生成散列等不明智的操作时遇到麻烦。它由app.exe.config 文件中的条目启用
- 该固定语句保持索引串便宜,避免了常规检查索引进行边界
- 第一个 Assert 确保字符串以零结尾,这是允许在循环中进行优化所必需的
- 第二个 Assert 确保字符串与 4 的倍数地址对齐,这是保持循环性能所必需的
- 循环是手动展开的,对于 32 位版本,每个循环消耗 4 个字符。转换为 int* 是一种在 int(32 位)中存储 2 个字符(2 x 16 位)的技巧。循环后的额外语句处理长度不是 4 的倍数的字符串。注意零终止符可能包含也可能不包含在哈希中,如果长度为偶数则不会。它查看字符串中的所有字符,回答您的问题
- 64 位版本的循环以不同的方式完成,手动展开 2。请注意,它在嵌入的零处提前终止,因此不会查看所有字符。否则非常罕见。这很奇怪,我只能猜测这与可能非常大的字符串有关。但是想不出实际的例子
- 最后的调试代码确保框架中的任何代码都不会依赖于在运行之间可重现的哈希代码。
- 哈希算法非常标准。值 1566083941 是一个幻数,是梅森绕线器中常见的质数。