如何比较 Java 中几乎相似的字符串?(字符串距离测量)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2084730/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 03:22:37  来源:igfitidea点击:

How to compare almost similar Strings in Java? (String distance measure)

javastringcomparisonlevenshtein-distancestring-metric

提问by hsmit

I would like to compare two strings and get some score how much these look alike. For example "The sentence is almost similar"and "The sentence is similar".

我想比较两个字符串并计算它们的相似程度。例如“句子几乎相似”“句子相似”

I'm not familiar with existing methods in Java, but for PHP I know the levenshtein function.

我不熟悉 Java 中的现有方法,但对于 PHP,我知道levenshtein 函数

Are there better methods in Java?

Java 中有更好的方法吗?

采纳答案by Joey

The Levensthein distance isa measure for how similar strings are. Or, more precisely, how many alterations have to be made that they are the same.

Levensthein 距离衡量字符串相似程度的指标。或者,更准确地说,必须进行多少更改才能使它们相同。

The algorithmis available in pseudo-code on Wikipedia. Converting that to Java shouldn't be much of a problem, but it's not built-in into the base class library.

算法在维基百科的伪代码中可用。将其转换为 Java 应该不是什么大问题,但它并没有内置到基类库中。

Wikipediahas some more algorithms that measure similarity of strings.

维基百科有更多的算法来衡量字符串的相似性。

回答by jspcal

yeah thats a good metric, you could use StringUtil.getLevenshteinDistance()from apache commons

是的,这是一个很好的指标,您可以使用apache commons 中的StringUtil.getLevenshteinDistance()

回答by FiveO

The following Java libraries offer multiple compare algorithms (Levenshtein,Jaro Winkler,...):

以下 Java 库提供了多种比较算法(Levenshtein、Jaro Winkler...):

  1. Apache Commons Lang 3: https://commons.apache.org/proper/commons-lang/
  2. Simmetrics: http://sourceforge.net/projects/simmetrics/
  1. Apache Commons Lang 3https: //commons.apache.org/proper/commons-lang/
  2. Simmetricshttp: //sourceforge.net/projects/simmetrics/

Both libraries have a java documentation (Apache Commons Lang Javadoc,Simmetrics Javadoc).

这两个库都有一个 Java 文档(Apache Commons Lang JavadocSimmetrics Javadoc)。

//Usage of Apache Commons Lang 3
import org.apache.commons.lang3.StringUtils;   
public double compareStrings(String stringA, String stringB) {
    return StringUtils.getJaroWinklerDistance(stringA, stringB);
}

 //Usage of Simmetrics
import uk.ac.shef.wit.simmetrics.similaritymetrics.JaroWinkler    
public double compareStrings(String stringA, String stringB) {
    JaroWinkler algorithm = new JaroWinkler();
    return algorithm.getSimilarity(stringA, stringB);
}

回答by Thibault Debatty

You can find implementations of Levenshtein and other string similarity/distance measures on https://github.com/tdebatty/java-string-similarity

您可以在https://github.com/tdebatty/java-string-similarity上找到 Levenshtein 和其他字符串相似性/距离度量的实现

If your project uses maven, installation is as simple as

如果你的项目使用maven,安装就这么简单

<dependency>
  <groupId>info.debatty</groupId>
  <artifactId>java-string-similarity</artifactId>
  <version>RELEASE</version>
</dependency>

Then, to use Levenshtein for example

然后,以使用 Levenshtein 为例

import info.debatty.java.stringsimilarity.*;

public class MyApp {

  public static void main (String[] args) {
    Levenshtein l = new Levenshtein();

    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
  }
}

回答by Vaibhav Kumar

Shameless plug, but I wrote a library also:

无耻的插件,但我也写了一个库:

https://github.com/vickumar1981/stringdistance

https://github.com/vickumar1981/stringdistance

It has all these functions, plus a few for phonetic similarity (if one word "sounds like" another word - returns either true or false unlike the other fuzzy similarities which are numbers between 0-1).

它具有所有这些功能,加上一些语音相似性(如果一个词“听起来像”另一个词 - 与其他模糊相似性(0-1之间的数字)不同,返回真或假)。

Also includes dna sequencing algorithms like Smith-Waterman and Needleman-Wunsch which are generalized versions of Levenshtein.

还包括 dna 测序算法,如 Smith-Waterman 和 Needleman-Wunsch,它们是 Levenshtein 的通用版本。

I plan, in the near future, on making this work with any array and not just strings (an array of characters).

我计划在不久的将来,使这项工作适用于任何数组,而不仅仅是字符串(字符数组)。