Java中的相似性字符串比较
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/955110/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Similarity String Comparison in Java
提问by Mario Ortegón
I want to compare several strings to each other, and find the ones that are the most similar. I was wondering if there is any library, method or best practice that would return me which strings are more similar to other strings. For example:
我想将几个字符串相互比较,并找到最相似的字符串。我想知道是否有任何库、方法或最佳实践可以让我返回哪些字符串与其他字符串更相似。例如:
- "The quick fox jumped" -> "The fox jumped"
- "The quick fox jumped" -> "The fox"
- “敏捷的狐狸跳了”->“狐狸跳了”
- “敏捷的狐狸跳了起来”->“狐狸”
This comparison would return that the first is more similar than the second.
这种比较将返回第一个比第二个更相似。
I guess I need some method such as:
我想我需要一些方法,例如:
double similarityIndex(String s1, String s2)
Is there such a thing somewhere?
有没有这样的地方?
EDIT: Why am I doing this? I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks. Because the legacy system has a very limited field width, when the values are added the descriptions are abbreviated. I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys. It has drawbacks, as it has to be still manually checked, but it would save a lot of work
编辑:我为什么要这样做?我正在编写一个脚本,将 MS Project 文件的输出与处理任务的某些遗留系统的输出进行比较。因为遗留系统的字段宽度非常有限,所以在添加值时,描述会被缩写。我想要一些半自动的方法来查找 MS Project 中的哪些条目与系统上的条目相似,以便我可以获得生成的密钥。它有缺点,因为它仍然必须手动检查,但它会节省大量工作
采纳答案by dfa
Yes, there are many well documented algorithms like:
是的,有许多有据可查的算法,例如:
- Cosine similarity
- Jaccard similarity
- Dice's coefficient
- Matching similarity
- Overlap similarity
- etc etc
- 余弦相似度
- 杰卡德相似度
- 骰子系数
- 匹配相似度
- 重叠相似度
- 等等等等
A good summary ("Sam's String Metrics") can be found here(original link dead, so it links to Internet Archive)
可以在这里找到一个很好的摘要(“Sam 的字符串度量”)(原始链接已失效,因此它链接到 Internet Archive)
Also check these projects:
还要检查这些项目:
回答by Florian Fankhauser
You could use Levenshtein distance to calculate the difference between two strings. http://en.wikipedia.org/wiki/Levenshtein_distance
您可以使用 Levenshtein 距离来计算两个字符串之间的差异。 http://en.wikipedia.org/wiki/Levenshtein_distance
回答by Anton Gogolev
Theoretically, you can compare edit distances.
理论上,您可以比较编辑距离。
回答by Laurence Gonsalves
This is typically done using an edit distancemeasure. Searching for "edit distance java" turns up a number of libraries, like this one.
回答by duffymo
Sounds like a plagiarism finderto me if your string turns into a document. Maybe searching with that term will turn up something good.
如果您的字符串变成文档,对我来说听起来像是抄袭发现者。也许用这个词搜索会发现一些好东西。
"Programming Collective Intelligence" has a chapter on determining whether two documents are similar. The code is in Python, but it's clean and easy to port.
“Programming Collective Intelligence”有一章是关于确定两个文档是否相似。代码是用 Python 编写的,但它干净且易于移植。
回答by user493744
I translated the Levenshtein distance algorithminto JavaScript:
我将Levenshtein 距离算法翻译成 JavaScript:
String.prototype.LevenshteinDistance = function (s2) {
var array = new Array(this.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i] = new Array(s2.length + 1);
for (var i = 0; i < this.length + 1; i++)
array[i][0] = i;
for (var j = 0; j < s2.length + 1; j++)
array[0][j] = j;
for (var i = 1; i < this.length + 1; i++) {
for (var j = 1; j < s2.length + 1; j++) {
if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
else {
array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
}
}
}
return array[this.length][s2.length];
};
回答by acdcjunior
The common way of calculating the similarity between two strings in a 0%-100% fashion, as used in many libraries, is to measure how much (in %) you'd have to change the longer string to turn it into the shorter:
以 0%-100% 的方式计算两个字符串之间相似度的常用方法,如许多库中所使用的,是测量您必须将较长的字符串更改为较短的字符串的程度(以 % 为单位):
/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below
Computing the editDistance()
:
计算editDistance()
:
The editDistance()
function above is expected to calculate the edit distancebetween the two strings. There are several implementationsto this step, each may suit a specific scenario better. The most common is the Levenshtein distance algorithmand we'll use it in our example below (for very large strings, other algorithms are likely to perform better).
editDistance()
上面的函数预计会计算两个字符串之间的编辑距离。此步骤有多种实现方式,每种实现方式可能更适合特定场景。最常见的是Levenshtein 距离算法,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能会表现得更好)。
Here's two options to calculate the edit distance:
以下是计算编辑距离的两个选项:
- You can use Apache Commons Text's implementation of Levenshtein distance:
apply(CharSequence left, CharSequence rightt)
- Implement it in your own. Below you'll find an example implementation.
- 您可以使用Apache Commons Text的 Levenshtein distance 实现:
apply(CharSequence left, CharSequence rightt)
- 自己实现。您将在下面找到一个示例实现。
Working example:
工作示例:
public class StringSimilarity {
/**
* Calculates the similarity (a number within 0 and 1) between two strings.
*/
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
/* // If you have Apache Commons Text, you can use it to calculate the edit distance:
LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// Example implementation of the Levenshtein Edit Distance
// See http://rosettacode.org/wiki/Levenshtein_distance#Java
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}
public static void printSimilarity(String s, String t) {
System.out.println(String.format(
"%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
}
public static void main(String[] args) {
printSimilarity("", "");
printSimilarity("1234567890", "1");
printSimilarity("1234567890", "123");
printSimilarity("1234567890", "1234567");
printSimilarity("1234567890", "1234567890");
printSimilarity("1234567890", "1234567980");
printSimilarity("47/2010", "472010");
printSimilarity("47/2010", "472011");
printSimilarity("47/2010", "AB.CDEF");
printSimilarity("47/2010", "4B.CDEFG");
printSimilarity("47/2010", "AB.CDEFG");
printSimilarity("The quick fox jumped", "The fox jumped");
printSimilarity("The quick fox jumped", "The fox");
printSimilarity("kitten", "sitting");
}
}
Output:
输出:
1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"
回答by Mohsen Abasi
Thank to the first answerer, I think there are 2 calculations of computeEditDistance(s1, s2). Due to high time spending of it, decided to improve the code's performance. So:
感谢第一个回答者,我认为computeEditDistance(s1, s2)有2次计算。由于花费了大量时间,决定提高代码的性能。所以:
public class LevenshteinDistance {
public static int computeEditDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0) {
costs[j] = j;
} else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
}
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0) {
costs[s2.length()] = lastValue;
}
}
return costs[s2.length()];
}
public static void printDistance(String s1, String s2) {
double similarityOfStrings = 0.0;
int editDistance = 0;
if (s1.length() < s2.length()) { // s1 should always be bigger
String swap = s1;
s1 = s2;
s2 = swap;
}
int bigLen = s1.length();
editDistance = computeEditDistance(s1, s2);
if (bigLen == 0) {
similarityOfStrings = 1.0; /* both strings are zero length */
} else {
similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
}
//////////////////////////
//System.out.println(s1 + "-->" + s2 + ": " +
// editDistance + " (" + similarityOfStrings + ")");
System.out.println(editDistance + " (" + similarityOfStrings + ")");
}
public static void main(String[] args) {
printDistance("", "");
printDistance("1234567890", "1");
printDistance("1234567890", "12");
printDistance("1234567890", "123");
printDistance("1234567890", "1234");
printDistance("1234567890", "12345");
printDistance("1234567890", "123456");
printDistance("1234567890", "1234567");
printDistance("1234567890", "12345678");
printDistance("1234567890", "123456789");
printDistance("1234567890", "1234567890");
printDistance("1234567890", "1234567980");
printDistance("47/2010", "472010");
printDistance("47/2010", "472011");
printDistance("47/2010", "AB.CDEF");
printDistance("47/2010", "4B.CDEFG");
printDistance("47/2010", "AB.CDEFG");
printDistance("The quick fox jumped", "The fox jumped");
printDistance("The quick fox jumped", "The fox");
printDistance("The quick fox jumped",
"The quick fox jumped off the balcany");
printDistance("kitten", "sitting");
printDistance("rosettacode", "raisethysword");
printDistance(new StringBuilder("rosettacode").reverse().toString(),
new StringBuilder("raisethysword").reverse().toString());
for (int i = 1; i < args.length; i += 2) {
printDistance(args[i - 1], args[i]);
}
}
}
回答by Thibault Debatty
There are indeed a lot of string similarity measures out there:
确实有很多字符串相似性度量:
- Levenshtein edit distance;
- Damerau-Levenshtein distance;
- Jaro-Winkler similarity;
- Longest Common Subsequence edit distance;
- Q-Gram (Ukkonen);
- n-Gram distance (Kondrak);
- Jaccard index;
- Sorensen-Dice coefficient;
- Cosine similarity;
- ...
- Levenshtein 编辑距离;
- Damerau-Levenshtein 距离;
- Jaro-Winkler 相似度;
- 最长公共子序列编辑距离;
- Q-Gram(乌科宁);
- n-克距离(康德拉克);
- 杰卡德指数;
- Sorensen-Dice 系数;
- 余弦相似度;
- ...
You can find explanation and java implementation of these here: https://github.com/tdebatty/java-string-similarity
你可以在这里找到这些的解释和 java 实现:https: //github.com/tdebatty/java-string-similarity
回答by noelicus
You can achieve this using the apache commons java library. Take a look at these two functions within it:
- getLevenshteinDistance
- getFuzzyDistance
您可以使用apache commons java 库来实现这一点。看看里面的这两个函数:
- getLevenshteinDistance
- getFuzzyDistance