Java 中的模糊字符串搜索库

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/327513/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 13:13:36  来源:igfitidea点击:

Fuzzy string search library in Java

javanlpfuzzy-search

提问by dario

I'm looking for a high performance Java library for fuzzy string search.

我正在寻找用于模糊字符串搜索的高性能 Java 库。

There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc.

有许多算法可以找到相似的字符串,Levenshtein distance、Daitch-Mokotoff Soundex、n-grams 等。

What Java implementations exists? Pros and cons for them? I'm aware of Lucene, any other solution or Lucene is best?

存在哪些 Java 实现?他们的利弊?我知道 Lucene,任何其他解决方案或 Lucene 是最好的?

I found these, does anyone have experience with them?

我找到了这些,有人用过吗?

回答by Vugluskr

Apache Luceneis the only way, I think. I don't know any better search lib.

我认为Apache Lucene是唯一的方法。我不知道有什么更好的搜索库。

Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Apache Lucene(TM) 是一个高性能、全功能的文本搜索引擎库,完全用 Java 编写。它是一种适用于几乎所有需要全文搜索的应用程序的技术,尤其是跨平台的。

回答by JodaStephen

Commons Lang has an implementation of Levenshtein distance.

Commons Lang 有Levenshtein distance的实现。

Commons Codec has an implementation of soundexand metaphone.

Commons Codec 有soundexmetaphone的实现。

回答by Darren

SimMetrics is probably what you need: http://sourceforge.net/projects/simmetrics/

SimMetrics 可能是您所需要的:http: //sourceforge.net/projects/simmetrics/

It has several algorithms for calculating various flavours of edit-distance.

它有几种算法来计算各种风格的编辑距离。

Lucene is a very powerful full-text search engine, but FT search isn't exactly the same thing as fuzzy string matching (eg. given a list of strings find me the one that is most similar to some candidate string).

Lucene 是一个非常强大的全文搜索引擎,但 FT 搜索与模糊字符串匹配并不完全相同(例如,给定一个字符串列表,找到与某个候选字符串最相似的那个)。

回答by Mojo Risin

You can try bitap. I was playing with bitap written in ANSI C and it was pretty fast there is java implementation in http://www.crosswire.org.

你可以试试bitap。我正在玩用 ANSI C 编写的 bitap,它非常快,http://www.crosswire.org 中有 java 实现。

回答by Mond Raymond

回答by Henno Vermeulen

You can use Apache Lucene, but depending on the use case this may be too heavy weight. For very simple fuzzy searches it may be a bit complex to use and (correct me if I'm wrong) it requires you to build an index.

您可以使用 Apache Lucene,但根据用例,这可能太重了。对于非常简单的模糊搜索,使用起来可能有点复杂(如果我错了,请纠正我)它需要您建立一个索引。

If you need a simple online (= not maintaining an index) algorithm you can use the fuzzy Bitap algorithm. I found an implementation in Java here. It's code fits in a single relatively short method with an almost self-explaining signature:

如果您需要一个简单的在线(= 不维护索引)算法,您可以使用模糊Bitap 算法。我在这里找到了一个 Java 实现。它的代码适合一个相对较短的方法,具有几乎不言自明的签名:

public static List<Integer> find(String doc, String pattern, int k)

Apache Commons StringUtilshas an implementation of the Levenshtein algorithm for fuzzy String matching. It can be seen as the fuzzy version of String.equals, Bitap is like the fuzzy version of String.indexOfand still uses the Levenshtein distance measure. It is generally more efficient than naively using Levenshtein to compare the search pattern with each substring that could possibly match.

Apache CommonsStringUtils有一个用于模糊字符串匹配的 Levenshtein 算法的实现。它可以看作是 的模糊版本String.equals,Bitap 就像 的模糊版本,String.indexOf仍然使用 Levenshtein 距离度量。通常比天真地使用 Levenshtein 将搜索模式与可能匹配的每个子字符串进行比较更有效。

Notes:

注意事项

  • The Bitap algorithm seems to be mostly useful for relatively small alphabets, e.g. plain ASCII. In fact the Simon Watiau version I linked to throws an ArrayIndexOutOfBoundsExceptionon non-ASCII characters (>= 128) so you will have to filter these out.
  • I tried using Bimap in an application to search an in-memory list of persons by name. I found that a Levenhstein distance of 2 gives way too many false positives. A Levenhstein distance of 1 works better, but it cannot detect a typo where you swap two letters, e.g. "William" and "Willaim". I can think of a few ways to solve this, e.g.

    1. do a fuzzy search only when an exact search finds no matches (and show a message to the user about this)
    2. adjust Bitap to use Damerau-Levenshtein distance where a swap has distance 1 instead of 2. According to wikipedia, this is possible, but I could not find an existing implementation in Java.
    3. instead of "contains" do a "startsWith". The fuzzy search toolscontains a prefix version of Damerau-Levenshtein, but it gave me an ArrayIndexOutOfBoundsException
    4. adjust the algorithm to introduce search result ranking where exact matches score higher

    If you are going to do 2 or 4, it may be better to use a proper full-text search library like Lucene anyway.

  • More information on fuzzy search can be found on this blog. It's author also created an implementation in Javacalled BitapOnlineSearcher, but requires you to use java.io.Readertogether with an Alphabet class. It's Javadoc is written in Russian.
  • Bitap 算法似乎最适用于相对较小的字母表,例如纯 ASCII。事实上,我链接到的 Simon Watiau 版本会ArrayIndexOutOfBoundsException在非 ASCII 字符(> = 128)上抛出一个,因此您必须将它们过滤掉。
  • 我尝试在应用程序中使用 Bimap 按姓名搜索内存中的人员列表。我发现 Levenhstein 距离为 2 会产生太多误报。Levenhstein 距离为 1 效果更好,但它无法检测到您交换两个字母的拼写错误,例如“William”和“Willaim”。我可以想到几种方法来解决这个问题,例如

    1. 仅当精确搜索找不到匹配项时才进行模糊搜索(并向用户显示有关此的消息)
    2. 调整 Bitap 以使用 Damerau-Levenshtein 距离,其中交换距离为 1 而不是 2。根据维基百科,这是可能的,但我找不到 Java 中的现有实现。
    3. 而不是“包含”做一个“startsWith”。在模糊搜索工具包含Damerau -莱文斯坦的前缀版本,但它给了我一个ArrayIndexOutOfBoundsException
    4. 调整算法以引入精确匹配得分更高的搜索结果排名

    如果您打算做 2 或 4,无论如何最好使用像 Lucene 这样的适当的全文搜索库。

  • 可以在此博客上找到有关模糊搜索的更多信息。它的作者还在Java 中创建了一个名为的实现BitapOnlineSearcher,但需要您java.io.Reader与 Alphabet 类一起使用。它的 Javadoc 是用俄语编写的。

回答by ?????????s

If you are mostly comparing short strings and want something portable and lightweight you can use the well known python algorithm fuzzywuzzy ported to Java.

如果您主要比较短字符串并想要一些可移植和轻量级的东西,您可以使用移植到 Java 的众所周知的 python 算法 Fuzzywuzzy 。

You can read more about it here

你可以在这里阅读更多关于它的信息

回答by Filipe Miguel Fonseca

You can try the Completelylibrary, it relies on text preprocessing to create an in-memory index for efficiently answering (fuzzy) searches in large data sets. Unlike Lucene and other full featured text search libraries, the API is small and easy to get started.

您可以尝试Completely库,它依赖于文本预处理来创建内存索引,以有效地回答大型数据集中的(模糊)搜索。与 Lucene 和其他功能齐全的文本搜索库不同,该 API 很小且易于上手。