java 优化 Jaro-Winkler 算法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2848807/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Optimizing Jaro-Winkler algorithm
提问by Pentium10
I have this code for Jaro-Winkler algorithm taken from thiswebsite. I need to run 150,000 times to get distance between differences. It takes a long time, as I run on an Android mobile device.
我有从这个网站上获取的 Jaro-Winkler 算法的代码。我需要运行 150,000 次才能获得差异之间的距离。这需要很长时间,因为我在 Android 移动设备上运行。
Can it be optimized more?
还能再优化吗?
public class Jaro {
/**
* gets the similarity of the two strings using Jaro distance.
*
* @param string1 the first input string
* @param string2 the second input string
* @return a value between 0-1 of the similarity
*/
public float getSimilarity(final String string1, final String string2) {
//get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
final int halflen = ((Math.min(string1.length(), string2.length())) / 2) + ((Math.min(string1.length(), string2.length())) % 2);
//get common characters
final StringBuffer common1 = getCommonCharacters(string1, string2, halflen);
final StringBuffer common2 = getCommonCharacters(string2, string1, halflen);
//check for zero in common
if (common1.length() == 0 || common2.length() == 0) {
return 0.0f;
}
//check for same length common strings returning 0.0f is not the same
if (common1.length() != common2.length()) {
return 0.0f;
}
//get the number of transpositions
int transpositions = 0;
int n=common1.length();
for (int i = 0; i < n; i++) {
if (common1.charAt(i) != common2.charAt(i))
transpositions++;
}
transpositions /= 2.0f;
//calculate jaro metric
return (common1.length() / ((float) string1.length()) +
common2.length() / ((float) string2.length()) +
(common1.length() - transpositions) / ((float) common1.length())) / 3.0f;
}
/**
* returns a string buffer of characters from string1 within string2 if they are of a given
* distance seperation from the position in string1.
*
* @param string1
* @param string2
* @param distanceSep
* @return a string buffer of characters from string1 within string2 if they are of a given
* distance seperation from the position in string1
*/
private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) {
//create a return buffer of characters
final StringBuffer returnCommons = new StringBuffer();
//create a copy of string2 for processing
final StringBuffer copy = new StringBuffer(string2);
//iterate over string1
int n=string1.length();
int m=string2.length();
for (int i = 0; i < n; i++) {
final char ch = string1.charAt(i);
//set boolean for quick loop exit if found
boolean foundIt = false;
//compare char with range of characters to either side
for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) {
//check if found
if (copy.charAt(j) == ch) {
foundIt = true;
//append character found
returnCommons.append(ch);
//alter copied string2 for processing
copy.setCharAt(j, (char)0);
}
}
}
return returnCommons;
}
}
I mention that in the whole process I make just instance of the script, so only once
我提到在整个过程中我只制作了脚本的实例,所以只有一次
jaro= new Jaro();
If you are going to test and need examples so not break the script, you will find it here, in another thread for python optimization
如果您要测试并需要示例以免破坏脚本,您可以在此处找到它,在另一个用于 Python 优化的线程中
采纳答案by bmargulies
Yes, but you aren't going to enjoy it. Replace all those newed StringBuffers with char arrays that are allocated in the constructor and never again, using integer indices to keep track of what's in them.
是的,但你不会享受它。用new在构造函数中分配的 char 数组替换所有那些ed StringBuffers,并且永远不会再次使用整数索引来跟踪它们中的内容。
This pending Commons-Lang patchwill give you some of the flavor.
这个待定的 Commons-Lang 补丁会给你一些味道。
回答by mvantol1
I know this question has probably been solved for some time, but I would like to comment on the algorithm itself. When comparing a string against itself, the answer turns out to be 1/|string| off. When comparing slightly different values, the values also turn out to be lower.
我知道这个问题可能已经解决了一段时间,但我想评论算法本身。将字符串与其自身进行比较时,结果是 1/|string| 离开。当比较略有不同的值时,这些值也变得更低。
The solution to this is to adjust 'm-1' to 'm' in the inner for-statement within the getCommonCharacters method. The code then works like a charm :)
对此的解决方案是在 getCommonCharacters 方法的内部 for 语句中将 'm-1' 调整为 'm'。然后代码就像一个魅力:)
See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distanceas well for some examples.
有关一些示例,请参见http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance。
回答by Ivan
I don't know much about Android and how it works with databases. WP7 has (will have :) ) SQL CE. The next step would typically be to work with your data. Add string lengths and limit your comparisons. Add indexes on both columns and sort by length and then by value. The index on length should be sorted as well. I had it run on an old server with 150 000 medical terms giving me suggestions and spell checking in under 0.5 seconds, users could barely notice it, especially if running on a separate thread.
我不太了解 Android 以及它如何与数据库配合使用。WP7 有(将有 :))SQL CE。下一步通常是处理您的数据。添加字符串长度并限制比较。在两列上添加索引并按长度排序,然后按值排序。长度索引也应该排序。我让它在一台旧服务器上运行,它有 150 000 个医学术语,在 0.5 秒内为我提供建议和拼写检查,用户几乎不会注意到它,尤其是在单独的线程上运行时。
I meant to blog about it for a long time (like 2 years :) ) because there is a need. But I finally manage to write few words about it and provide some tips. Please check it out here:
我打算在博客上写很长一段时间(比如 2 年 :)),因为有需要。但我终于设法写了几句话并提供了一些提示。请在这里查看:
Although it is for Microsoft platform, still general principles are the same.
虽然是针对微软平台,但大体原理还是一样的。
回答by larsga
Yes, this can be made a lot faster. For one thing, you don't need the StringBuffers at all. For another, you don't need a separate loop to count transpositions.
是的,这可以做得更快。一方面,您根本不需要 StringBuffers。另一方面,您不需要单独的循环来计算换位。
You can find my implementation here, and it should be a lot faster. It's under Apache 2.0 License.
回答by dvidben
Instead returning the common characters using GetCommonCharacters method, use a couple of arrays to keep the matches, similarly to the C version here https://github.com/miguelvps/c/blob/master/jarowinkler.c
而不是使用 GetCommonCharacters 方法返回公共字符,而是使用几个数组来保持匹配,类似于这里的 C 版本https://github.com/miguelvps/c/blob/master/jarowinkler.c
/*Calculate matching characters*/
for (i = 0; i < al; i++) {
for (j = max(i - range, 0), l = min(i + range + 1, sl); j < l; j++) {
if (a[i] == s[j] && !sflags[j]) {
sflags[j] = 1;
aflags[i] = 1;
m++;
break;
}
}
}
Another optimization is to pre-calculate a bitmask for each string. Using that, check if the current character on the first string is present on the second. This can be done using efficient bitwise operations.
另一个优化是预先计算每个字符串的位掩码。使用它,检查第一个字符串上的当前字符是否存在于第二个字符串中。这可以使用高效的按位运算来完成。
This will skip calculating the max/min and looping for missing characters.
这将跳过计算最大/最小和循环丢失字符。
回答by Rubys
- Try to avoid the two nested loops in the getCommonCharacters loop.
Suggestion as to how: store all the chars in the smaller string in a map of some sort(java has a few), where the key is the character and the value is the position, that way you can still calculate the distance, wether they are in common. I don't quite understand the algorithm, but I think this is doable. - Except for that and bmargulies's answer, I really don't see further optimizations beyond stuff like bits etc. If this is really critical, consider rewriting this portion in C?
- 尽量避免 getCommonCharacters 循环中的两个嵌套循环。
关于如何的建议:将较小字符串中的所有字符存储在某种映射中(java 有一些),其中键是字符,值是位置,这样您仍然可以计算距离,无论它们是共同的。我不太了解算法,但我认为这是可行的。 - 除了那个和 bmargulies 的回答之外,我真的没有看到除位等内容之外的进一步优化。如果这真的很重要,请考虑用 C 重写这部分?

