VBA 哈希字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14717526/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 14:53:44  来源:igfitidea点击:

VBA hash string

vbahashexcel-2003

提问by nixda

How do I get a short hash of a long string using Excel VBA

如何使用 Excel VBA 获取长字符串的短哈希值

Whats given

给出了什么

  • Input string is not longer than 80 characters
  • Valid input characters are: [0..9] [A_Z] . _ /
  • Valid output characters are [0..9] [A_Z] [a_z] (lower and upper case can be used)
  • The output hash shouldn't be longer than ~12 characters (shorter is even better)
  • No need to be unique at all since this will result in a too long hash
  • 输入字符串不超过 80 个字符
  • 有效的输入字符是: [0..9] [A_Z] 。_ /
  • 有效的输出字符为 [0..9] [A_Z] [a_z] (可以使用小写和大写)
  • 输出哈希不应超过 ~12 个字符(越短越好)
  • 根本不需要是唯一的,因为这会导致哈希过长

What I have done so far

到目前为止我所做的

I thought this SO answeris a good start since it generates a 4-digit Hex-Code (CRC16).

我认为这个 SO 答案是一个好的开始,因为它生成了一个 4 位十六进制代码(CRC16)。

But 4 digits were to little. In my test with 400 strings 20% got a duplicate somewhere else.
The chance to generate a collision is too high.

但是4位数太少了。在我对 400 个字符串的测试中,20% 在其他地方得到了重复。
产生碰撞的机会太高。

Sub tester()
    For i = 2 To 433
        Cells(i, 2) = CRC16(Cells(i, 1))
    Next i
End Sub


Function CRC16(txt As String)
Dim x As Long
Dim mask, i, j, nC, Crc As Integer
Dim c As String

Crc = &HFFFF

For nC = 1 To Len(txt)
    j = Val("&H" + Mid(txt, nC, 2))
    Crc = Crc Xor j
    For j = 1 To 8
        mask = 0
        If Crc / 2 <> Int(Crc / 2) Then mask = &HA001
        Crc = Int(Crc / 2) And &H7FFF: Crc = Crc Xor mask
    Next j
Next nC

CRC16 = Hex$(Crc)
End Function

How to reproduce

如何繁殖

You can copy these 400 test strings from pastebin.
Paste them to column A in a new Excel workbook and execute the code above.

您可以从 pastebin复制这 400 个测试字符串
将它们粘贴到新 Excel 工作簿中的 A 列并执行上面的代码。

Q:How do I get a string hash which is short enough (12 chars) and long enough to get a small percentage of duplicates.

问:如何获得足够短(12 个字符)和足够长的字符串哈希以获取少量重复项。

采纳答案by Floris

Split your string into three shorter strings (if not divisible by three, the last one will be longer than the other two). Run your "short" algorithm on each, and concatenate the results.

将您的字符串拆分为三个较短的字符串(如果不能被 3 整除,则最后一个将比其他两个长)。在每个上运行“短”算法,并连接结果。

I could write the code but based on the quality of the question I think you can take it from here!

我可以编写代码,但根据问题的质量,我认为您可以从这里获取!

EDIT: It turns out that that advice is not enough. There is a serious flaw in your original CRC16 code - namely the line that says:

编辑:事实证明,这个建议是不够​​的。您的原始 CRC16 代码中存在严重缺陷 - 即以下行:

j = Val("&H" + Mid(txt, nC, 2))

This only handles text that can be interpreted as hex values: lowercase and uppercase letters are the same, and anything after F in the alphabet is ignored (as far as I can tell). That anything good comes out at all is a miracle. If you replace the line with

这仅处理可以解释为十六进制值的文本:小写和大写字母相同,字母表中 F 之后的任何内容都将被忽略(据我所知)。任何好的东西出现都是一个奇迹。如果你用

j = asc(mid(txt, nC, 1))

Things work better - every ASCII code at least starts out life as its own value.

事情变得更好了 - 每个 ASCII 代码至少以它自己的价值开始。

Combining this change with the proposal I made earlier, you get the following code:

将此更改与我之前提出的建议相结合,您将获得以下代码:

Function hash12(s As String)
' create a 12 character hash from string s

Dim l As Integer, l3 As Integer
Dim s1 As String, s2 As String, s3 As String

l = Len(s)
l3 = Int(l / 3)
s1 = Mid(s, 1, l3)      ' first part
s2 = Mid(s, l3 + 1, l3) ' middle part
s3 = Mid(s, 2 * l3 + 1) ' the rest of the string...

hash12 = hash4(s1) + hash4(s2) + hash4(s3)

End Function

Function hash4(txt)
' copied from the example
Dim x As Long
Dim mask, i, j, nC, crc As Integer
Dim c As String

crc = &HFFFF

For nC = 1 To Len(txt)
    j = Asc(Mid(txt, nC)) ' <<<<<<< new line of code - makes all the difference
    ' instead of j = Val("&H" + Mid(txt, nC, 2))
    crc = crc Xor j
    For j = 1 To 8
        mask = 0
        If crc / 2 <> Int(crc / 2) Then mask = &HA001
        crc = Int(crc / 2) And &H7FFF: crc = crc Xor mask
    Next j
Next nC

c = Hex$(crc)

' <<<<< new section: make sure returned string is always 4 characters long >>>>>
' pad to always have length 4:
While Len(c) < 4
  c = "0" & c
Wend

hash4 = c

End Function

You can place this code in your spreadsheet as =hash12("A2")etc. For fun, you can also use the "new, improved" hash4 algorithm, and see how they compare. I created a pivot table to count collisions - there were none for the hash12algorithm, and only 3 for the hash4. I'm sure you can figure out how to create hash8, ... from this. The "no need to be unique" from your question suggests that maybe the "improved" hash4is all you need.

您可以将此代码放在电子表格中=hash12("A2")。为了好玩,您还可以使用“新的、改进的”hash4 算法,并查看它们的比较情况。我创建了一个数据透视表来计算碰撞 -hash12算法没有,而hash4. 我相信你可以弄清楚如何创建hash8, ... 从这里开始。您的问题中的“无需独特”表明您可能只需要“改进”即可hash4

In principle, a four character hex should have 64k unique values - so the chance of two random strings having the same hash would be 1 in 64k. When you have 400 strings, there are 400 x 399 / 2 "possible collision pairs" ~ 80k opportunities (assuming you had highly random strings). Observing three collisions in the sample dataset is therefore not an unreasonable score. As your number of strings N goes up, the probability of collisions goes as the square of N. With the extra 32 bits of information in the hash12, you expect to see collisions when N > 20 M or so (handwaving, in-my-head-math).

原则上,一个四字符的十六进制应该有 64k 个唯一值——所以两个随机字符串具有相同散列的几率是 64k 中的 1 个。当您有 400 个字符串时,有 400 x 399 / 2 个“可能的碰撞对”~ 80k 机会(假设您有高度随机的字符串)。因此,在样本数据集中观察三个碰撞并不是一个不合理的分数。随着字符串数量 N 的增加,冲突的概率是 N 的平方。 有了 hash12 中额外的 32 位信息,您希望在 N > 20 M 左右时看到冲突(handwaving,in-my-头数学)。

You can make the hash12 code a little bit more compact, obviously - and it should be easy to see how to extend it to any length.

显然,您可以使 hash12 代码更紧凑一点 - 并且应该很容易看到如何将其扩展到任何长度。

Oh - and one last thing. If you have RC addressing enabled, using =CRC16("string")as a spreadsheet formula gives a hard-to-track #REFerror... which is why I renamed it hash4

哦,还有最后一件事。如果您启用了 RC 寻址,=CRC16("string")作为电子表格公式使用会产生难以跟踪的#REF错误......这就是我重命名它的原因hash4

回答by nixda

Maybe others will find this useful.

也许其他人会发现这很有用。

I have collected some different functions to generate a short hash of a string in VBA.
I don't take credit for the code and all sources are referenced.

我收集了一些不同的函数来在 VBA 中生成字符串的短散列。
我不相信代码,并且引用了所有来源。

enter image description here

在此处输入图片说明

  1. CRC16
    • Function: =CRC16HASH(A1)with this Code
    • hash is a 4 characters long HEX string
    • 19 code lines
    • 4 digits long hash = 624 collisions in 6895 lines = 9 % collision rate
  2. CRC16 numeric
    • Function: =CRC16NUMERIC(A1)with this Code
    • hash is a 5 digits long number
    • 92 code lines
    • 5 digits long hash = 616 collisions in 6895 lines = 8.9 % collision rate
  3. CRC16 twice
    • Function: =CRC16TWICE(A1)with this Code
    • hash is a 8 characters long HEX string
    • hash can be expanded to 12/16/20 etc. characters to reduce collision rate even more
    • 39 code lines
    • 8 digits long hash = 18 collisions in 6895 lines = 0.23 % collision rate
  4. SHA1
    • Function: =SHA1TRUNC(A1)with this Code
    • hash is a 40 characters long HEX string
    • 142 code lines
    • can be truncated
    • 4 digits hash = 726 collisions in 6895 lines = 10.5 % collision rate
    • 5 digits hash = 51 collisions in 6895 lines = 0.73 % collision rate
    • 6 digits hash = 0 collisions in 6895 lines = 0 % collision rate
  5. SHA1 + Base64
    • Function: =BASE64SHA1(A1)with this Code
    • hash is a 28 characters long unicode string (case sensitive + special chars)
    • 41 code lines
    • requires .NET since it uses library "Microsoft MSXML"
    • can be truncated
    • 4 digits hash = 36 collisions in 6895 lines = 0.5 % collision rate
    • 5 digits hash = 0 collisions in 6895 lines = 0 % collision rate
  1. CRC16
    • 功能:=CRC16HASH(A1)使用此代码
    • hash 是一个 4 个字符长的十六进制字符串
    • 19 行代码
    • 4 位长哈希 = 6895 行中的 624 次冲突 = 9 % 冲突率
  2. CRC16 数字
    • 功能:=CRC16NUMERIC(A1)使用此代码
    • hash 是一个 5 位长的数字
    • 92 行代码
    • 5 位长哈希 = 6895 行中的 616 次冲突 = 8.9 % 的冲突率
  3. CRC16 两次
    • 功能:=CRC16TWICE(A1)使用此代码
    • hash 是一个 8 个字符长的十六进制字符串
    • hash 可以扩展到 12/16/20 等字符以进一步降低碰撞率
    • 39 行代码
    • 8 位长哈希 = 6895 行中的 18 次冲突 = 0.23 % 冲突率
  4. SHA1
    • 功能:=SHA1TRUNC(A1)使用此代码
    • hash 是一个 40 个字符长的十六进制字符串
    • 142 行代码
    • 可以截断
    • 4 位哈希 = 6895 行中的 726 次冲突 = 10.5 % 冲突率
    • 5 位哈希 = 6895 行中的 51 次冲突 = 0.73 % 冲突率
    • 6 位哈希 = 6895 行中的 0 次冲突 = 0 % 冲突率
  5. SHA1 + Base64
    • 功能:=BASE64SHA1(A1)使用此代码
    • hash 是一个 28 个字符长的 unicode 字符串(区分大小写 + 特殊字符)
    • 41 行代码
    • 需要 .NET,因为它使用库“Microsoft MSXML”
    • 可以截断
    • 4 位哈希 = 6895 行中的 36 次冲突 = 0.5 % 冲突率
    • 5 位哈希 = 6895 行中的 0 次冲突 = 0 % 冲突率

Hereis my test workbook with all example functions and a big number of test strings.

是我的测试工作簿,其中包含所有示例函数和大量测试字符串。

Feel free to add own functions.

随意添加自己的功能。

回答by Florent B.

32 bits hash function for strings with a low level of collision:

低级别冲突字符串的 32 位哈希函数:

Public Function StrHash(text As String) As Long
    Dim i As Long
    StrHash = &H65D5BAAA

    For i = 1 To Len(text)
        StrHash = ((StrHash + AscW(Mid$(text, i, 1))) Mod 69208103) * 31&
    Next
End Function

Or as a 64 bits hash function:

或者作为 64 位哈希函数:

Public Function StrHash64(text As String) As String
    Dim i&, h1&, h2&, c&
    h1 = &H65D5BAAA
    h2 = &H2454A5ED

    For i = 1 To Len(text)
        c = AscW(Mid$(text, i, 1))
        h1 = ((h1 + c) Mod 69208103) * 31&
        h2 = ((h2 + c) Mod 65009701) * 33&
    Next

    StrHash64 = Right("00000000" & Hex(h1), 8) & Right("00000000" & Hex(h2), 8)
End Function

Based on the FNV hash algorithm

基于FNV哈希算法

回答by Assad Ebrahim

While the below is not a hash function, I've used it as a quick way to generate numeric id's that have a low collision rate over a small list (small enough to verify by inspection).

虽然下面的不是哈希函数,但我已将其用作生成数字 id 的快速方法,这些数字 id 在小列表(小到足以通过检查验证)上具有低冲突率。

How it Works: Column A holds the strings from row 2 onward. In row 1, A1 and B1 hold an arbitrary start and end position midway in the string. The formula uses the first letter of the string and a fixed letter taken from mid-string and uses LEN() as a 'fanning function' to reduce the chance of collisions.

工作原理:A 列包含第 2 行以后的字符串。在第 1 行中,A1 和 B1 在字符串中间保持任意的开始和结束位置。该公式使用字符串的第一个字母和取自字符串中间的固定字母,并使用 LEN() 作为“扇形函数”以减少冲突的机会。

 =CODE(A2)*LEN(A2) + CODE(MID(A2,$A,$B))*LEN(MID(A2,$A,$B))

If strings are pulled from a database table with fixed width fields, you may need to trim the lengths:

如果从具有固定宽度字段的数据库表中提取字符串,则可能需要修剪长度:

 =CODE(TRIM(C8))*LEN(TRIM(C8))
       +CODE(MID(TRIM(C8),$A,1))*LEN(MID(TRIM(C8),$A,$B))