Java 在 List.contains(String) 的情况下部分匹配字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6645379/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-16 08:35:58  来源:igfitidea点击:

Partially match strings in case of List.contains(String)

javaregex

提问by y2p

I have a List<String>

我有一个 List<String>

List<String> list = new ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");

if I do list.contains("EFGH"), it returns true. Can I get a true in case of list.contains("IJ")? I mean, can I partially match strings to find if they exist in the list?

如果我这样做list.contains("EFGH"),它会返回true。我可以得到一个真实的情况list.contains("IJ")吗?我的意思是,我可以部分匹配字符串以查找它们是否存在于列表中吗?

I have a list of 15000 strings. And I have to check about 10000 strings if they exist in the list. What could be some other (faster) way to do this?

我有一个包含 15000 个字符串的列表。我必须检查大约 10000 个字符串是否存在于列表中。有什么其他(更快的)方法可以做到这一点?

Thanks.

谢谢。

回答by Hovercraft Full Of Eels

Perhaps you want to put each String group into a HashSet, and by fragment, I mean don't add "IJ KL" but rather add "IJ" and "KL" separately. If you need both the list and this search capabilities, you may need to maintain two collections.

也许您想将每个 String 组放入一个 HashSet 中,通过片段,我的意思是不要添加“IJ KL”,而是分别添加“IJ”和“KL”。如果您需要列表和此搜索功能,您可能需要维护两个集合。

回答by Roadrunner-EX

You can iterate over the list, and then call contains() on each String.

您可以遍历列表,然后对每个字符串调用 contains()。

public boolean listContainsString(List<string> list. String checkStr)
{
    Iterator<String> iter = list.iterator();
    while(iter.hasNext())
    {
        String s = iter.next();
        if (s.contain(checkStr))
        {
            return true;
        }
    }
    return false;
}

Something like that should work, I think.

我认为这样的事情应该有效。

回答by Roadrunner-EX

As a second answer, upon rereading your question, you could also inherit from the interface List, specialize it for Stringsonly, and override the contains() method.

作为第二个答案,在重新阅读您的问题时,您还可以从 interface 继承ListStrings仅将其专门化,并覆盖 contains() 方法。

public class PartialStringList extends ArrayList<String>
{
    public boolean contains(Object o)
    {
        if(!(o instanceof String))
        {
            return false;
        }
        String s = (String)o;
        Iterator<String> iter = iterator();
        while(iter.hasNext())
        {
            String iStr = iter.next();
            if (iStr.contain(s))
            {
                return true;
            }
        }
        return false;
    }
}

Judging by your earlier comments, this is maybe not the speed you're looking for, but is this more similar to what you were asking for?

从您之前的评论来看,这可能不是您想要的速度,但这与您要求的速度是否更相似?

回答by Eng.Fouad

How about:

怎么样:

java.util.List<String> list = new java.util.ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");
java.util.regex.Pattern p = java.util.regex.Pattern.compile("IJ");
java.util.regex.Matcher m = p.matcher("");
for(String s : list)
{
    m.reset(s);
    if(m.find()) System.out.println("Partially Matched");
}

回答by Kowser

If suggestion from Roadrunner-EX does not suffice then, I believe you are looking for Knuth–Morris–Pratt algorithm.

如果 Roadrunner-EX 的建议还不够,我相信您正在寻找Knuth-Morris-Pratt 算法

Time complexity:

时间复杂度:

  • Time complexity of the table algorithm is O(n), preprocessing time
  • Time complexity of the search algorithm is O(k)
  • 表算法的时间复杂度为O(n),预处理时间
  • 搜索算法的时间复杂度为 O(k)

So, the complexity of the overall algorithm is O(n + k).

所以,整个算法的复杂度是 O(n + k)。

  • n = Size of the List
  • k = length of pattern you are searching for
  • n = 列表的大小
  • k = 您要搜索的模式长度

Normal Brute-Force will have time complexity of O(nm)

正常的蛮力将具有 O(nm) 的时间复杂度

Moreover KMP algorithm will take same O(k) complexity for searching with same search string, on the other hand, it will be always O(km) for brute force approach.

此外,KMP 算法将采用相同的 O(k) 复杂度来搜索相同的搜索字符串,另一方面,对于蛮力方法,它总是 O(km)。

回答by Bohemian

Here's some code that uses a regex to shortcut the inner loop if noneof the test Strings are found in the target String.

如果在目标字符串中没有找到任何测试字符串,这里有一些代码使用正则表达式来缩短内部循环。

public static void main(String[] args) throws Exception {
    List<String> haystack = Arrays.asList(new String[] { "ABCD", "EFGH", "IJ KL", "M NOP", "UVW X" });
    List<String> needles = Arrays.asList(new String[] { "IJ", "NOP" });

    // To cut down on iterations, create one big regex to check the whole haystack
    StringBuilder sb = new StringBuilder();
    sb.append(".*(");
    for (String needle : needles) {
        sb.append(needle).append('|');
    }
    sb.replace(sb.length() - 1, sb.length(), ").*");
    String regex = sb.toString();

    for (String target : haystack) {
        if (!target.matches(regex)) {
            System.out.println("Skipping " + target);
            continue;
        }

        for (String needle : needles) {
            if (target.contains(needle)) {
                System.out.println(target + " contains " + needle);
            }
        }
    }
}

Output:

输出:

Skipping ABCD
Skipping EFGH
IJ KL contains IJ
M NOP contains NOP
Skipping UVW X

If you really want to get cute, you could bisect use a binary search to identify which segments of the target list matches, but it mightn't be worth it.

如果你真的想变得可爱,你可以使用二分搜索来确定目标列表的哪些部分匹配,但这可能不值得。

It depends which is how likely it is that yo'll find a hit. Low hit rates will give a good result. High hit rates will perform not much better than the simple nested loop version. consider inverting the loops if some needles hit many targets, and other hit none.

这取决于您找到成功的可能性有多大。低命中率会带来好的结果。高命中率不会比简单的嵌套循环版本好多少。如果一些针击中许多目标,而其他针没有击中目标,请考虑反转循环。

It's all about aborting a search path ASAP.

这一切都是为了尽快中止搜索路径。

回答by Powerslave

Yes, you can! Sort of.

是的你可以!有点。

What you are looking for, is often called fuzzy searchingor approximate string matchingand there are several solutions to this problem.

您要查找的内容通常称为模糊搜索近似字符串匹配,此问题有多种解决方案。

With the FuzzyWuzzylib, for example, you can have all your strings assigned a score based on how similar they are to a particular search term. The actual values seem to be integer percentages of the number of characters matching with regards to the search string length.

例如,使用FuzzyWuzzy库,您可以根据它们与特定搜索词的相似程度为所有字符串分配一个分数。实际值似乎是与搜索字符串长度匹配的字符数的整数百分比。

After invoking FuzzySearch.extractAll, it is up to you to decide what the minimum score would be for a string to be considered a match.

调用 之后FuzzySearch.extractAll,由您决定将字符串视为匹配的最低分数是多少。

There are also other, similar libraries worth checking out, like google-diff-match-patchor the Apache Commons Text Similarity API, and so on.

还有其他类似的库值得一试,比如google-diff-match-patchApache Commons Text Similarity API等等。

If you need something really heavy-duty, your best bet would probably be Lucene(as also mentioned by Ryan Shillington)

如果你需要一些真正重型的东西,你最好的选择可能是Lucene(正如Ryan Shillington也提到的)

回答by brunobastosg

You could use IterableUtilsfrom Apache Commons Collections.

你可以使用IterableUtils阿帕奇百科全书集合

List<String> list = new ArrayList<String>();
list.add("ABCD");
list.add("EFGH");
list.add("IJ KL");
list.add("M NOP");
list.add("UVW X");

boolean hasString = IterableUtils.contains(list, "IJ", new Equator<String>() {
    @Override
    public boolean equate(String o1, String o2) {
        return o2.contains(o1);
    }

    @Override
    public int hash(String o) {
        return o.hashCode();
    }
});

System.out.println(hasString); // true