VBA 中的加权 Damerau-Levenshtein
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13693149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Weighted Damerau-Levenshtein in VBA
提问by rgmatthes
I'm building a private spellchecker for the Microsoft Office suite. I'm doing string comparisons of typos and their potential fixes to determine which corrections I want included.
我正在为 Microsoft Office 套件构建一个私人拼写检查器。我正在对拼写错误及其潜在修复进行字符串比较,以确定我想要包含哪些更正。
I've looked high and low for a weightedDamerau-Levenshtein formula for string comparison because I want swaps, insertions, deletions and replacements to all have different weights, not simply a weight of "1", so I can give preference to some corrections over others. For example, the typo "agmes" could theoretically correct to "games" or"ages", since both require just one edit to move to either correctly spelled word, but I'd like to give the "swap" edit a lower weight so that "games" would show as the preferred correction.
对于用于字符串比较的加权Damerau-Levenshtein 公式,我已经看高了和低了,因为我希望交换、插入、删除和替换都具有不同的权重,而不仅仅是“1”的权重,因此我可以优先进行一些更正超过其他人。例如,拼写错误“agmes”理论上可以更正为“游戏”或“年龄”,因为两者都只需要一次编辑即可移动到拼写正确的单词,但我想给“交换”编辑一个较低的权重“游戏”将显示为首选更正。
I'm using Excel for analysis, so any code I use needs to be in Visual Basic for Applications (VBA). The best I could find is this example, which seems great, but it's in Java. I tried my best to convert, but I'm far from an expert and could use a little help!
我使用 Excel 进行分析,因此我使用的任何代码都需要在 Visual Basic for Applications (VBA) 中。我能找到的最好的是这个 example,它看起来很棒,但它是用 Java 编写的。我已尽力转换,但我远非专家,需要一点帮助!
Can anyone take a look at the attached code and help me figure out what's wrong?
任何人都可以看看附加的代码并帮助我找出问题所在吗?
THANK YOU!
谢谢你!
EDIT: I got it working on my own. Here's a weighted Damerau-Levenshtein formula in VBA. It uses Excel's built-in math functions for some evaluation. When comparing a typo to two possible corrections, the correction with the highestcost is the preferred word. This is because the cost of two swaps must be greater than the cost of a deletion and an insertion, and that's not possible if you assign swaps with the lowest cost (which I think is ideal). Check out Kevin's blog if you need more info.
编辑:我让它自己工作。这是 VBA 中的加权 Damerau-Levenshtein 公式。它使用 Excel 的内置数学函数进行一些评估。当将一个错字与两个可能的更正进行比较时,成本最高的更正是首选词。这是因为两次交换的成本必须大于删除和插入的成本,如果您分配成本最低的交换(我认为这是理想的),这是不可能的。如果您需要更多信息,请查看 Kevin 的博客。
Public Function WeightedDL(source As String, target As String) As Double
Dim deleteCost As Double
Dim insertCost As Double
Dim replaceCost As Double
Dim swapCost As Double
deleteCost = 1
insertCost = 1.1
replaceCost = 1.1
swapCost = 1.2
Dim i As Integer
Dim j As Integer
Dim k As Integer
If Len(source) = 0 Then
WeightedDL = Len(target) * insertCost
Exit Function
End If
If Len(target) = 0 Then
WeightedDL = Len(source) * deleteCost
Exit Function
End If
Dim table() As Double
ReDim table(Len(source), Len(target))
Dim sourceIndexByCharacter() As Variant
ReDim sourceIndexByCharacter(0 To 1, 0 To Len(source) - 1) As Variant
If Left(source, 1) <> Left(target, 1) Then
table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost))
End If
sourceIndexByCharacter(0, 0) = Left(source, 1)
sourceIndexByCharacter(1, 0) = 0
Dim deleteDistance As Double
Dim insertDistance As Double
Dim matchDistance As Double
For i = 1 To Len(source) - 1
deleteDistance = table(i - 1, 0) + deleteCost
insertDistance = ((i + 1) * deleteCost) + insertCost
If Mid(source, i + 1, 1) = Left(target, 1) Then
matchDistance = (i * deleteCost) + 0
Else
matchDistance = (i * deleteCost) + replaceCost
End If
table(i, 0) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance)
Next
For j = 1 To Len(target) - 1
deleteDistance = table(0, j - 1) + insertCost
insertDistance = ((j + 1) * insertCost) + deleteCost
If Left(source, 1) = Mid(target, j + 1, 1) Then
matchDistance = (j * insertCost) + 0
Else
matchDistance = (j * insertCost) + replaceCost
End If
table(0, j) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance)
Next
For i = 1 To Len(source) - 1
Dim maxSourceLetterMatchIndex As Integer
If Mid(source, i + 1, 1) = Left(target, 1) Then
maxSourceLetterMatchIndex = 0
Else
maxSourceLetterMatchIndex = -1
End If
For j = 1 To Len(target) - 1
Dim candidateSwapIndex As Integer
candidateSwapIndex = -1
For k = 0 To UBound(sourceIndexByCharacter, 2)
If sourceIndexByCharacter(0, k) = Mid(target, j + 1, 1) Then candidateSwapIndex = sourceIndexByCharacter(1, k)
Next
Dim jSwap As Integer
jSwap = maxSourceLetterMatchIndex
deleteDistance = table(i - 1, j) + deleteCost
insertDistance = table(i, j - 1) + insertCost
matchDistance = table(i - 1, j - 1)
If Mid(source, i + 1, 1) <> Mid(target, j + 1, 1) Then
matchDistance = matchDistance + replaceCost
Else
maxSourceLetterMatchIndex = j
End If
Dim swapDistance As Double
If candidateSwapIndex <> -1 And jSwap <> -1 Then
Dim iSwap As Integer
iSwap = candidateSwapIndex
Dim preSwapCost
If iSwap = 0 And jSwap = 0 Then
preSwapCost = 0
Else
preSwapCost = table(Application.Max(0, iSwap - 1), Application.Max(0, jSwap - 1))
End If
swapDistance = preSwapCost + ((i - iSwap - 1) * deleteCost) + ((j - jSwap - 1) * insertCost) + swapCost
Else
swapDistance = 500
End If
table(i, j) = Application.Min(Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance), swapDistance)
Next
sourceIndexByCharacter(0, i) = Mid(source, i + 1, 1)
sourceIndexByCharacter(1, i) = i
Next
WeightedDL = table(Len(source) - 1, Len(target) - 1)
End Function
回答by Nigel Heffernan
I can see you've answered this yourself: I wrote a modified Levenshtein edit distance algorithm for address matching a couple of years ago (the site's now hosted in Russia and it's a bad idea to go there) but that didn't perform at all well, and a 'sum of common strings' approach was adequate for the task in hand:
我可以看到你自己已经回答了这个问题:几年前我写了一个修改过的 Levenshtein 编辑距离算法来进行地址匹配(该网站现在在俄罗斯托管,去那里是个坏主意)但根本没有执行好吧,“通用字符串的总和”方法足以完成手头的任务:
Fuzzy-Matching strings in Excel using a simplified 'Edit Distance' proxy in VBA
使用 VBA 中简化的“编辑距离”代理在 Excel 中模糊匹配字符串
That code probably needs re-testing and re-work.
该代码可能需要重新测试和重新工作。
Looking at your code, if you ever want to revisit it, here's a speed tip:
查看您的代码,如果您想重新访问它,这里有一个速度提示:
Dim arrByte() As Byte Dim byteChar As Byte arrByte = strSource for i = LBound(arrByte) To UBound(arrByte) Step 2 byteChar = arrByte(i) ' I'll do some comparison operations using integer arithmetic on the char Next i
String-handling in VBA is horribly slow, even if you use Mid$() instead of Mid(), but numeric operations are pretty good: and strings are actually arrays of bytes, which the compiler will accept at face value.
VBA 中的字符串处理速度非常慢,即使您使用 Mid$() 而不是 Mid(),但数字运算非常好:并且字符串实际上是字节数组,编译器将接受它的表面价值。
The 'step' of 2 in the loop is to skip over the high-order bytes in unicode strings - you're probablyrunning your string comparison on plain-vanilla ASCII text, and you'll see that the byte array for (say) "ABCd" is (00, 65, 00, 66, 00, 67, 00, 100). Most of the Latin alphabet in Western European countries - accents, diacritics, dipthongs and all - will fit in under 255 and won't venture into the higer-order bytes that show as zeroes in that wxample.
循环中 2 的“步骤”是跳过 unicode 字符串中的高位字节 - 您可能正在对普通 ASCII 文本运行字符串比较,并且您会看到(例如)的字节数组"ABCd" 是 (00, 65, 00, 66, 00, 67, 00, 100)。西欧国家的大部分拉丁字母——重音符号、变音符号、双元音等等——都适合 255 以下,并且不会冒险进入在该 wxample 中显示为零的高阶字节。
You'll get away with it in strictly monolingualstring comparisons in Hebrew, Greek, Russian and Arabic because the upper byte is constant within each alphabet: Greek "αβγδ" is the byte array (177,3,178,3,179,3,180,3). However, that's sloppy coding and it'll bite (or byte) you the moment you try string comparisons across languages. And it's never going to fly in Eastern alphabets.
在希伯来语、希腊语、俄语和阿拉伯语的严格单语字符串比较中,您将摆脱它,因为每个字母表中的高位字节是恒定的:希腊语“αβγδ”是字节数组 (177,3,178,3,179,3,180,3)。但是,这是草率的编码,当您尝试跨语言进行字符串比较时,它会咬(或字节)您。它永远不会在东方字母表中飞行。
回答by Big Wave
Believe these lines are wrong:-
相信这些行是错误的:-
deleteDistance = table(0, j - 1) + insertCost
insertDistance = ((j + 1) * insertCost) + deleteCost
Think should be: -
认为应该是:-
deleteDistance = ((j + 1) * insertCost) + deleteCost
insertDistance = table(0, j - 1) + insertCost
Haven't gone through the code to work out what is occurring however the below is odd!!!
还没有通过代码来弄清楚发生了什么,但是下面很奇怪!!!
If Left(source, 1) <> Left(target, 1) Then
table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost))
End If
As you will need to replace, delete, or insert it possibly should be:-
由于您需要替换、删除或插入它可能应该是:-
If Left(source, 1) <> Left(target, 1) Then
table(0, 0) = Application.Min(replaceCost, Application.Min(deleteCost, insertCost))
End If