ios Levenshtein 距离算法比 O(n*m) 更好?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4057513/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-30 17:58:59  来源:igfitidea点击:

Levenshtein Distance Algorithm better than O(n*m)?

iosalgorithmbig-olevenshtein-distance

提问by Jason

I have been looking for an advanced levenshtein distance algorithm, and the best I have found so faris O(n*m) where n and m are the lengths of the two strings. The reason why the algorithm is at this scale is because of space, not time, with the creation of a matrix of the two strings such as this one:

我一直在寻找一种先进的 levenshtein 距离算法,到目前为止我发现的最好的是 O(n*m) ,其中 n 和 m 是两个字符串的长度。算法采用这种规模的原因是因为空间,而不是时间,创建了两个字符串的矩阵,例如:

alt text

替代文字

Is there a publicly-available levenshtein algorithm which is better than O(n*m)?I am not averse to looking at advanced computer science papers & research, but haven't been able to find anything. I have found one company, Exorbyte, which supposedly has built a super-advanced and super-fast Levenshtein algorithm but of course that is a trade secret. I am building an iPhone app which I would like to use the Levenshtein distance calculation. There is an objective-c implementation available, but with the limited amount of memory on iPods and iPhones, I'd like to find a better algorithm if possible.

是否有比 O(n*m) 更好的公开可用的 levenshtein 算法?我并不反对查看高级计算机科学论文和研究,但一直找不到任何东西。我找到了一家名为 Exorbyte 的公司,据说它已经构建了一种超先进和超快的 Levenshtein 算法,但这当然是商业机密。我正在构建一个 iPhone 应用程序,我想使用 Levenshtein 距离计算。有一个可用的objective-c 实现,但由于iPod 和iPhone 上的内存量有限,如果可能的话,我想找到更好的算法。

回答by srean

Are you interested in reducing the time complexity or the space complexity ? The average time complexity can be reduced O(n + d^2), where n is the length of the longer string and d is the edit distance. If you are only interested in the edit distance and not interested in reconstructing the edit sequence, you only need to keep the last two rows of the matrix in memory, so that will be order(n).

您对降低时间复杂度还是空间复杂度感兴趣?平均时间复杂度可以降低 O(n + d^2),其中 n 是较长字符串的长度,d 是编辑距离。如果您只对编辑距离感兴趣而对重建编辑序列不感兴趣,则只需将矩阵的最后两行保留在内存中,即为 order(n)。

If you can afford to approximate, there are poly-logarithmic approximations.

如果您能负担得起近似值,则可以使用多对数近似值。

For the O(n +d^2) algorithm look for Ukkonen's optimization or its enhancement Enhanced Ukkonen. The best approximation that I know of is this one by Andoni, Krauthgamer, Onak

对于 O(n +d^2) 算法,寻找 Ukkonen 的优化或其增强Enhanced Ukkonen。我所知道的最好的近似值是 Andoni、Krauthgamer、Onak 的这个

回答by Nick Johnson

If you only want the threshold function - eg, to test if the distance is under a certain threshold - you can reduce the time and space complexity by only calculating the n values either side of the main diagonal in the array. You can also use Levenshtein Automatato evaluate many words against a single base word in O(n) time - and the construction of the automatons can be done in O(m) time, too.

如果您只需要阈值函数 - 例如,测试距离是否低于某个阈值 - 您可以通过仅计算数组中主对角线两侧的 n 值来降低时间和空间复杂度。您还可以使用Levenshtein Automata在 O(n) 时间内针对单个基本词评估多个单词 - 并且自动机的构建也可以在 O(m) 时间内完成。

回答by Dani

Look in Wiki - they have some ideas to improve this algorithm to better space complexity:

在 Wiki 中查看 - 他们有一些想法可以改进此算法以提高空间复杂度:

Wiki-Link: Levenshtein distance

维基链接:莱文斯坦距离

Quoting:

引用:

We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.

我们可以调整算法以使用更少的空间,O(m) 而不是 O(mn),因为它只需要在任何时间存储前一行和当前行。

回答by nponeccop

I found another optimization that claims to be O(max(m, n)):

我发现了另一个声称是 O(max(m, n)) 的优化:

http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C

http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C

(the second C implementation)

(第二个C实现)