VBA 中的加权 Damerau-Levenshtein

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13693149/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 18:41:34  来源:igfitidea点击:

Weighted Damerau-Levenshtein in VBA

excelexcel-vbalevenshtein-distancevba

提问by rgmatthes

I'm building a private spellchecker for the Microsoft Office suite. I'm doing string comparisons of typos and their potential fixes to determine which corrections I want included.

我正在为 Microsoft Office 套件构建一个私人拼写检查器。我正在对拼写错误及其潜在修复进行字符串比较,以确定我想要包含哪些更正。

I've looked high and low for a weightedDamerau-Levenshtein formula for string comparison because I want swaps, insertions, deletions and replacements to all have different weights, not simply a weight of "1", so I can give preference to some corrections over others. For example, the typo "agmes" could theoretically correct to "games" or"ages", since both require just one edit to move to either correctly spelled word, but I'd like to give the "swap" edit a lower weight so that "games" would show as the preferred correction.

对于用于字符串比较的加权Damerau-Levenshtein 公式,我已经看高了和低了,因为我希望交换、插入、删除和替换都具有不同的权重,而不仅仅是“1”的权重,因此我可以优先进行一些更正超过其他人。例如,拼写错误“agmes”理论上可以更正为“游戏”“年龄”,因为两者都只需要一次编辑即可移动到拼写正确的单词,但我想给“交换”编辑一个较低的权重“游戏”将显示为首选更正。

I'm using Excel for analysis, so any code I use needs to be in Visual Basic for Applications (VBA). The best I could find is this example, which seems great, but it's in Java. I tried my best to convert, but I'm far from an expert and could use a little help!

我使用 Excel 进行分析,因此我使用的任何代码都需要在 Visual Basic for Applications (VBA) 中。我能找到的最好的是这个 example,它看起来很棒,但它是用 Java 编写的。我已尽力转换,但我远非专家,需要一点帮助!

Can anyone take a look at the attached code and help me figure out what's wrong?

任何人都可以看看附加的代码并帮助我找出问题所在吗?

THANK YOU!

谢谢你!

EDIT: I got it working on my own. Here's a weighted Damerau-Levenshtein formula in VBA. It uses Excel's built-in math functions for some evaluation. When comparing a typo to two possible corrections, the correction with the highestcost is the preferred word. This is because the cost of two swaps must be greater than the cost of a deletion and an insertion, and that's not possible if you assign swaps with the lowest cost (which I think is ideal). Check out Kevin's blog if you need more info.

编辑:我让它自己工作。这是 VBA 中的加权 Damerau-Levenshtein 公式。它使用 Excel 的内置数学函数进行一些评估。当将一个错字与两个可能的更正进行比较时,成本最高的更正是首选词。这是因为两次交换的成本必须大于删除和插入的成本,如果您分配成本最低的交换(我认为这是理想的),这是不可能的。如果您需要更多信息,请查看 Kevin 的博客。

Public Function WeightedDL(source As String, target As String) As Double

    Dim deleteCost As Double
    Dim insertCost As Double
    Dim replaceCost As Double
    Dim swapCost As Double

    deleteCost = 1
    insertCost = 1.1
    replaceCost = 1.1
    swapCost = 1.2

    Dim i As Integer
    Dim j As Integer
    Dim k As Integer

    If Len(source) = 0 Then
        WeightedDL = Len(target) * insertCost
        Exit Function
    End If

    If Len(target) = 0 Then
        WeightedDL = Len(source) * deleteCost
        Exit Function
    End If

    Dim table() As Double
    ReDim table(Len(source), Len(target))

    Dim sourceIndexByCharacter() As Variant
    ReDim sourceIndexByCharacter(0 To 1, 0 To Len(source) - 1) As Variant

    If Left(source, 1) <> Left(target, 1) Then
        table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost))
    End If

    sourceIndexByCharacter(0, 0) = Left(source, 1)
    sourceIndexByCharacter(1, 0) = 0

    Dim deleteDistance As Double
    Dim insertDistance As Double
    Dim matchDistance As Double

    For i = 1 To Len(source) - 1

        deleteDistance = table(i - 1, 0) + deleteCost
        insertDistance = ((i + 1) * deleteCost) + insertCost

        If Mid(source, i + 1, 1) = Left(target, 1) Then
            matchDistance = (i * deleteCost) + 0
        Else
            matchDistance = (i * deleteCost) + replaceCost
        End If

        table(i, 0) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance)
    Next

    For j = 1 To Len(target) - 1

        deleteDistance = table(0, j - 1) + insertCost
        insertDistance = ((j + 1) * insertCost) + deleteCost

        If Left(source, 1) = Mid(target, j + 1, 1) Then
            matchDistance = (j * insertCost) + 0
        Else
            matchDistance = (j * insertCost) + replaceCost
        End If

        table(0, j) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance)
    Next

    For i = 1 To Len(source) - 1

        Dim maxSourceLetterMatchIndex As Integer

        If Mid(source, i + 1, 1) = Left(target, 1) Then
            maxSourceLetterMatchIndex = 0
        Else
            maxSourceLetterMatchIndex = -1
        End If

        For j = 1 To Len(target) - 1

            Dim candidateSwapIndex As Integer
            candidateSwapIndex = -1

            For k = 0 To UBound(sourceIndexByCharacter, 2)
                If sourceIndexByCharacter(0, k) = Mid(target, j + 1, 1) Then candidateSwapIndex = sourceIndexByCharacter(1, k)
            Next

            Dim jSwap As Integer
            jSwap = maxSourceLetterMatchIndex

            deleteDistance = table(i - 1, j) + deleteCost
            insertDistance = table(i, j - 1) + insertCost
            matchDistance = table(i - 1, j - 1)

            If Mid(source, i + 1, 1) <> Mid(target, j + 1, 1) Then
                matchDistance = matchDistance + replaceCost
            Else
                maxSourceLetterMatchIndex = j
            End If

            Dim swapDistance As Double

            If candidateSwapIndex <> -1 And jSwap <> -1 Then

                Dim iSwap As Integer
                iSwap = candidateSwapIndex

                Dim preSwapCost
                If iSwap = 0 And jSwap = 0 Then
                    preSwapCost = 0
                Else
                    preSwapCost = table(Application.Max(0, iSwap - 1), Application.Max(0, jSwap - 1))
                End If

                swapDistance = preSwapCost + ((i - iSwap - 1) * deleteCost) + ((j - jSwap - 1) * insertCost) + swapCost

            Else
                swapDistance = 500
            End If

            table(i, j) = Application.Min(Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance), swapDistance)

        Next

        sourceIndexByCharacter(0, i) = Mid(source, i + 1, 1)
        sourceIndexByCharacter(1, i) = i

    Next

    WeightedDL = table(Len(source) - 1, Len(target) - 1)

End Function

回答by Nigel Heffernan

I can see you've answered this yourself: I wrote a modified Levenshtein edit distance algorithm for address matching a couple of years ago (the site's now hosted in Russia and it's a bad idea to go there) but that didn't perform at all well, and a 'sum of common strings' approach was adequate for the task in hand:

我可以看到你自己已经回答了这个问题:几年前我写了一个修改过的 Levenshtein 编辑距离算法来进行地址匹配(该网站现在在俄罗斯托管,去那里是个坏主意)但根本没有执行好吧,“通用字符串的总和”方法足以完成手头的任务:

Fuzzy-Matching strings in Excel using a simplified 'Edit Distance' proxy in VBA

使用 VBA 中简化的“编辑距离”代理在 Excel 中模糊匹配字符串

That code probably needs re-testing and re-work.

该代码可能需要重新测试和重新工作。

Looking at your code, if you ever want to revisit it, here's a speed tip:

查看您的代码,如果您想重新访问它,这里有一个速度提示

Dim arrByte() As Byte 
Dim byteChar As Byte 

arrByte = strSource

for i = LBound(arrByte) To UBound(arrByte) Step 2 
    byteChar = arrByte(i)     ' I'll do some comparison operations using integer arithmetic on the char
Next i 

String-handling in VBA is horribly slow, even if you use Mid$() instead of Mid(), but numeric operations are pretty good: and strings are actually arrays of bytes, which the compiler will accept at face value.

VBA 中的字符串处理速度非常慢,即使您使用 Mid$() 而不是 Mid(),但数字运算非常好:并且字符串实际上是字节数组,编译器将接受它的表面价值。

The 'step' of 2 in the loop is to skip over the high-order bytes in unicode strings - you're probablyrunning your string comparison on plain-vanilla ASCII text, and you'll see that the byte array for (say) "ABCd" is (00, 65, 00, 66, 00, 67, 00, 100). Most of the Latin alphabet in Western European countries - accents, diacritics, dipthongs and all - will fit in under 255 and won't venture into the higer-order bytes that show as zeroes in that wxample.

循环中 2 的“步骤”是跳过 unicode 字符串中的高位字节 - 您可能正在对普通 ASCII 文本运行字符串比较,并且您会看到(例如)的字节数组"ABCd" 是 (00, 65, 00, 66, 00, 67, 00, 100)。西欧国家的大部分拉丁字母——重音符号、变音符号、双元音等等——都适合 255 以下,并且不会冒险进入在该 wxample 中显示为零的高阶字节。

You'll get away with it in strictly monolingualstring comparisons in Hebrew, Greek, Russian and Arabic because the upper byte is constant within each alphabet: Greek "αβγδ" is the byte array (177,3,178,3,179,3,180,3). However, that's sloppy coding and it'll bite (or byte) you the moment you try string comparisons across languages. And it's never going to fly in Eastern alphabets.

在希伯来语、希腊语、俄语和阿拉伯语的严格单语字符串比较中,您将摆脱它,因为每个字母表中的高位字节是恒定的:希腊语“αβγδ”是字节数组 (177,3,178,3,179,3,180,3)。但是,这是草率的编码,当您尝试跨语言进行字符串比较时,它会咬(或字节)您。它永远不会在东方字母表中飞行。

回答by Big Wave

Believe these lines are wrong:-

相信这些行是错误的:-

deleteDistance = table(0, j - 1) + insertCost
insertDistance = ((j + 1) * insertCost) + deleteCost

Think should be: -

认为应该是:-

deleteDistance = ((j + 1) * insertCost) + deleteCost
insertDistance = table(0, j - 1) + insertCost

Haven't gone through the code to work out what is occurring however the below is odd!!!

还没有通过代码来弄清楚发生了什么,但是下面很奇怪!!!

If Left(source, 1) <> Left(target, 1) Then
    table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost))
End If

As you will need to replace, delete, or insert it possibly should be:-

由于您需要替换、删除或插入它可能应该是:-

If Left(source, 1) <> Left(target, 1) Then
    table(0, 0) = Application.Min(replaceCost, Application.Min(deleteCost, insertCost))
End If