Java 用于在字符串中搜索子字符串的快速算法

Question

提问by Joel

I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.

我想要一个有效的算法（或库），我可以在 Java 中使用它来搜索字符串中的子字符串。

What I would like to do is:

我想做的是：

Given an input string - INSTR:

给定一个输入字符串 - INSTR：

"BCDEFGH"

“BCDEFGH”

And a set of candidate strings - CAND:

和一组候选字符串 - CAND：

"AB", "CDE", "FG", "H", "IJ"

“AB”、“CDE”、“FG”、“H”、“IJ”

Find any CANDstrings that match as substrings within INSTR

查找在INSTR中作为子字符串匹配的任何CAND字符串

In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")

在本例中，我将匹配“CDE”、“FG”和“H”（但不匹配“AB”和“IJ”）

There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.

可能有数千个候选字符串（在 CAND 中），但更重要的是，我将进行数百万次此搜索，因此我需要快速搜索。

I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.

我想使用字符数组。此外，我对架构解决方案并不感兴趣，例如分发搜索 - 只是在本地进行最有效的功能/算法。

Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.

此外，CAND 和 INSTR 中的所有字符串都将相对较小（< 50 个字符）——即目标字符串 INSTR 相对于候选字符串不长。

UpdateI should have mentioned, the set of CAND strings is invariant across all values of INSTR.

更新我应该提到，CAND 字符串集在 INSTR 的所有值中都是不变的。

UpdateI only need to know that there was a match - and i don't need to know what the match was.

更新我只需要知道有一场比赛 - 我不需要知道比赛是什么。

Final UpdateI opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this

最终更新由于实现的简单性，我选择尝试 AhoCorsick 和 Rabin-Karp。因为我有可变长度的模式，所以我使用了一个修改过的 Rabin-Karp，它对每个模式的前 n 个字符进行散列，其中 n 是最小模式的长度，然后 N 是我的滚动子字符串搜索窗口的长度。对于 Aho Corsick 我使用了这个

In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:

在我的测试中，我在两个文档新闻论文文章中搜索了 1000 个模式，平均跨越 1000 次迭代等......归一化完成的时间是：

AhoCorsick: 1

RabinKarp: 1.8

拉宾卡普：1.8

Naive Search(check each pattern & use string.contains): 50

Naive Search（检查每个模式并使用 string.contains）：50

*Some resources describing the algos mentioned in the answers below:

*描述以下答案中提到的算法的一些资源：

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*

Answer 1

采纳答案by Daniel Brückner

Read up on the Aho-Corasick algorithmand the Rabin-Karp algorithm.

阅读Aho-Corasick 算法和Rabin-Karp 算法。

If the input is not too large, you don't want to repeat the search many times and you do not have many patterns, it might be a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithmsgives many algorithms with running and preprocessing times.

如果输入不是太大，您不想重复搜索很多次并且您没有很多模式，那么多次使用单一模式算法可能是个好主意。该搜索算法维基百科的文章给出了运行和预处理时间很多算法。

Implementations:

实现：

Presentations:

演示文稿：

http://www.slideshare.net/taka111/ahocorasick-string-matching-algorithm-15078438

http://www.slideshare.net/taka111/ahocorasick-string-matching-algorithm-15078438

Answer 2

回答by Antti Huima

Convert the set of candidate strings into a deterministic finite state automaton and then run through the input string in linear time. Converting a single string into a DFS is well-covered in the standard books. You can convert a set of strings by first constructing a non-deterministic automaton and then determinizing it. That can create exponential blow-up in the worst case in the size of the automaton but the search afterwards is fast; especially if the target string is long and the candidates short that's going to work well.

将候选字符串集转换为确定性有限状态自动机，然后在线性时间内遍历输入字符串。标准书籍中详细介绍了将单个字符串转换为 DFS。您可以通过首先构造一个非确定性自动机然后确定它来转换一组字符串。在最坏的情况下，这可能会造成自动机大小的指数膨胀，但之后的搜索速度很快；特别是如果目标字符串很长而候选字符串很短，那会很好用。

Answer 3

回答by Avi

You might want to look into Aho-Corasick algorithmand related algorithms. I don't know of any libraries that implement this, offhand, but this is the classic way of solving this problem.

您可能想研究Aho-Corasick 算法和相关算法。我不知道有任何库可以实现这一点，但这是解决这个问题的经典方法。

Answer 4

回答by emptyset

Rabin-Karp multiple pattern searchappears to be the fastest.

Rabin-Karp 多模式搜索似乎是最快的。

Answer 5

回答by spoulson

Also check the Boyer-Moore algorithmfor single-string pattern matching.

还要检查Boyer-Moore 算法以进行单字符串模式匹配。

Answer 6

回答by J?rgen Fogh

This is what regular expressions are for. As noted above, finite state automata are what you need, but that is exactly how a standard regexp-matcher is implemented.

这就是正则表达式的用途。如上所述，有限状态自动机是您所需要的，但这正是标准正则表达式匹配器的实现方式。

In java you could write something like:

在 Java 中，您可以编写如下内容：

StringBuilder sb = new StringBuilder();
bool first = true;
for (String subStr : substrings) {
    if (first)
        first = false;
    else
        sb.append('|');
    sb.append(escape(subStr));
}
Pattern p = Pattern.compile(sb.toString());

the method escapeshould escape any characters which have special meanings in a regexp.

该方法escape应该转义任何在正则表达式中具有特殊含义的字符。

Answer 7

回答by Nick Dandoulakis

Another solution is to use a suffix arrayfor the INSTR.
Since the INSTRis small you can sort it with bubble sort.

另一种解决方案是使用一个后缀数组为INSTR。
由于INSTR很小，您可以使用冒泡排序对其进行排序。

Afterwards you can search for a specific CANDstring in O(logN) time,
where N = length(suffix_array) = length(INSTR).

之后您可以在 O(logN) 时间内搜索特定的CAND字符串，
其中 N = length(suffix_array) = length(INSTR)。

Answer 8

回答by Joy Dutta

We can take advantage of the small size (< 50 char) of the strings to build a super fast algo for this case, at the cost of memory.

我们可以利用字符串的小尺寸（< 50 个字符）为这种情况构建一个超快速算法，但代价是内存。

We can hash all possible substrings of INSTR in a hash one time that will cost O(n^2) time. Then regardless of the number of CAND strings, the lookup will be O(1). Worth it for a very large number of CAND strings.

我们可以将 INSTR 的所有可能子字符串散列一次，这将花费 O(n^2) 时间。然后无论 CAND 字符串的数量如何，查找都将是 O(1)。值得为大量的 CAND 字符串。

If INSTR is large, then we can build a suffix array and not sort it, so that the top item is the longest (=N) and bottom item is the last char of INSTR. Now for each CAND string, only search from the top as long as length(CAND) <= length(suffix). Each of those comparisons will be O(n).

如果 INSTR 很大，那么我们可以构建一个后缀数组而不对其进行排序，这样顶部的项目是最长的（=N），底部的项目是 INSTR 的最后一个字符。现在对于每个 CAND 字符串，只要 length(CAND) <= length(suffix)，就只从顶部搜索。这些比较中的每一个都是 O(n)。

Answer 9

回答by Mike

Hereare some implementation of fast String search algorithms in Java.

下面是一些 Java 中快速字符串搜索算法的实现。

Answer 10

回答by Deepak Kumar

import java.util.Scanner;

public class StringMatch 
{
    static int temp,i=0,j=0; static boolean flag=true,matcher=false;

    static String str=null,mstr=null;static char astr[],amstr[];

    static void getter(){
        Scanner sc = new Scanner(System.in);
        str = sc.nextLine();
        //String str="today is Monday"; 
        astr=str.toCharArray();
         mstr = sc.nextLine();
        //String mstr="is"; 
         amstr=mstr.toCharArray();
    }

    static void stringMatch(){
        while(i<astr.length){
            if(astr[i]==amstr[j]){
            while((j!=amstr.length)&&flag){temp=i;
                if(astr[i]!=amstr[j]) {flag=false;matcher=false;}
                else{matcher=true;}
                i++;j++;
                //System.out.println(i+"\t"+j);
            }if(matcher==true)break;i=temp;}i++;j=0;flag=true;

        }
        if(matcher==true) {System.out.println("true");}
        else    {System.out.println("false");}
    }

    public static void main(String[] args) {

    StringMatch.getter();
    StringMatch.stringMatch();

    }
}

Java 用于在字符串中搜索子字符串的快速算法

提问by Joel

采纳答案by Daniel Brückner

回答by Antti Huima

回答by Avi

回答by emptyset

回答by spoulson

回答by J?rgen Fogh

回答by Nick Dandoulakis

回答by Joy Dutta

回答by Mike

回答by Deepak Kumar

相关推荐

最近更新

标签

Java 用于在字符串中搜索子字符串的快速算法

提问by Joel

采纳答案by Daniel Brückner

回答by Antti Huima

回答by Avi

回答by emptyset

回答by spoulson

回答by J?rgen Fogh

回答by Nick Dandoulakis

回答by Joy Dutta

回答by Mike

回答by Deepak Kumar

相关推荐

Java 捆绑中未解决的约束，缺少要求 osgi.wiring.package

Java 为什么类需要实现可序列化的标记接口进行序列化？

Java：拆分逗号分隔的字符串但忽略引号中的逗号

Java spring-boot 属性注入在自定义 @Configuration 类中不起作用

相关推荐

最近更新

标签