java 提取所有出现的模式 K 并在 1 次传递中检查字符串是否与“K*”匹配

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16586656/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 23:22:56  来源:igfitidea点击:

Extract all occurrences of pattern K and check if string matches "K*" in 1 pass

javaregex

提问by Bernhard Barker

For a given input string and a given pattern K, I want to extract every occurrence of K (or some part of it (using groups)) from the string andcheck that the entire string matches K*(as in it consists of 0 or more K's with no other characters).

对于给定的输入字符串和给定的模式 K,我想从字符串中提取 K(或其中的某些部分(使用组))的每次出现,检查整个字符串是否匹配K*(因为它由 0 个或多个 K 组成)没有其他字符)。

But I would like to do this in a single passusing regular expressions. More specifically, I'm currently finding the pattern using Matcher.find, but this is not strictly required.

但我想使用正则表达式一次性完成此操作。更具体地说,我目前正在使用 找到模式Matcher.find,但这不是严格要求的。

How would I do this?

我该怎么做?

I already found a solution (and posted an answer), but would like to know if there is specific regex or Matcherfunctionality that addresses / can address this issue, or simply if there are better / different ways of doing it. But, even if not, I still think it's an interesting question.

我已经找到了一个解决方案(并发布了一个答案),但想知道是否有特定的正则表达式或Matcher功能可以解决/可以解决这个问题,或者是否有更好/不同的方法来解决这个问题。但是,即使没有,我仍然认为这是一个有趣的问题。

Example:

例子:

Pattern: <[0-9]>(a single digit in <>)

图案:(中<[0-9]>的一位数<>

Valid input: <1><2><3>

有效输入: <1><2><3>

Invalid inputs:

无效输入:

<1><2>a<3>
<1><2>3
Oh look, a flying monkey!
<1><2><3

Code to do it in 2 passes with matches:

代码在 2 遍中完成matches

boolean products(String products)
{
    String regex = "(<[0-9]>)";
    Pattern pAll = Pattern.compile(regex + "*");

    if (!pAll.matcher(products).matches())
        return false;

    Pattern p = Pattern.compile(regex);
    Matcher matcher = p.matcher(products);

    while (matcher.find())
        System.out.println(matcher.group());

    return true;
}

采纳答案by nhahtdh

1. Defining the problem

1. 定义问​​题

Since it is not clear what to output when the whole string does not match pattern K*, I will redefine the problem to make it clear what to output in such case.

由于在整个字符串不匹配 pattern 时不清楚输出什么K*,我将重新定义问题以明确这种情况下输出什么。

Given any pattern K:

给定任何模式 K:

  • Check that the string has the pattern K*.
  • If the string has pattern K*, then split the string into non-overlapping tokens that matches K.
  • If the string only has prefix that matches pattern K*, then pick the prefix that is chosen by K*+1, and split the prefix into tokens that matches K.
  • 检查字符串是否具有模式K*
  • 如果字符串具有 pattern K*,则将字符串拆分为匹配的非重叠标记K
  • 如果字符串只有与 pattern 匹配K*的前缀,则选择由K*+1选择的前缀,并将该前缀拆分为与 K 匹配的标记。

1I don't know if there is anyway to get the longest prefix that matches K. Of course, you can always remove the last character one by one and test against K*until it matches, but it is obviously inefficient.

1不知道有没有办法得到与K匹配的最长前缀。当然,你总是可以将最后一个字符一个一个的去掉,然后再测试,K*直到匹配为止,但显然效率低下。

Unless specify otherwise, whatever I write below will follow my problem description above. Note that the 3rd bullet point of the problem is to resolve the ambiguity on which prefix string to take.

除非另有说明,否则我在下面写的任何内容都将遵循我上面的问题描述。请注意,问题的第三个要点是解决要采用哪个前缀字符串的歧义。

2. Repeated capturing group in .NET

2..NET中重复捕获组

The problem above can be solved if we have the solution to the problem:

如果我们有问题的解决方案,则可以解决上述问题:

Given a pattern (K)*, which is a repeated capturing group, get the captured text for all the repetitions, instead of only the last repetition.

给定一个 pattern (K)*,它是一个重复的捕获组,获取所有重复的捕获文本,而不仅仅是最后一个重复。

  • In the case where the string has pattern K*, by matching against ^(K)*$, we can get all tokens that match pattern K.
  • In the case where the string only has prefix that matches K*, by matching against ^(K)*, we can get all tokens that match pattern K.
  • 在字符串有 pattern 的情况下K*,通过匹配对^(K)*$,我们可以得到所有匹配 pattern 的标记K
  • 在字符串只有匹配的前缀的情况下K*,通过匹配^(K)*,我们可以得到所有匹配模式的标记K

This is the case in .NET regex, since it keeps all the captured text for a repeated capturing group.

.NET regex 就是这种情况,因为它为重复的捕获组保留所有捕获的文本。

However, since we are using Java, we don't have access to such feature.

但是,由于我们使用的是 Java,因此我们无法访问此类功能。

3. Solution in Java

3.Java中的解决方案

Checking that the string has the pattern K*can always be done with Matcher.matches()/String.matches(), since the engine will do full-blown backtracking on the input string to somehow "unify" K*with the input string. The hard thing is to split the input string into tokens that matches pattern K.

检查字符串是否具有模式K*总是可以使用Matcher.matches()/完成String.matches(),因为引擎将对输入字符串进行全面的回溯以某种方式K*与输入字符串“统一” 。困难的是将输入字符串拆分为与 pattern 匹配的标记K

If K*is equivalent to K*+

如果K*等价于K*+

If the pattern K has the property:

如果模式 K 具有以下属性:

For all strings2, K*is equivalent to K*+, i.e. how the input string is split up into tokens that match pattern Kis the same.

对于所有字符串2K*等价于K*+,即输入字符串如何拆分为匹配模式的标记K是相同的。

2You can define this condition for only the input strings you are operating on, but ensuring this pre-condition is not easy. When you define it for all strings, you only need to analyze your regex to check whether the condition holds or not.

2您可以仅为您正在操作的输入字符串定义此条件,但确保此前提条件并不容易。当你为所有字符串定义它时,你只需要分析你的正则表达式来检查条件是否成立。

Then a one-pass solution that solves the problem can be constructed. You can repeatedly use Matcher.find()on the pattern \GK, and checks that the last match found is right at the end of the string. This is similar to your current solution, except that you do the boundary check with code.

然后可以构建解决问题的一次性解决方案。您可以Matcher.find()在 pattern 上重复使用\GK,并检查找到的最后一个匹配项是否正好位于字符串的末尾。这类似于您当前的解决方案,不同之处在于您使用代码进行边界检查。

The +after the quantifier *in K*+makes the quantifier possessive. Possessive quantifier will prevent the engine from backtracking, which means each repetition is always the first possible match for the pattern K. We need this property so that the solution \GKhas equivalent meaning, since it will also return the first possible match for the pattern K.

+之后的量词*K*+使量词占有欲。占有量词将阻止引擎回溯,这意味着每次重复总是模式 K 的第一个可能匹配项。我们需要这个属性以便解决方案\GK具有等效含义,因为它还将返回模式 K 的第一个可能匹配项。

If K*is NOT equivalent to K*+

如果K*不等于K*+

Without the property above, we need 2 passes to solve the problem. First pass to call Matcher.matches()/String.matches()on the pattern K*. On second pass:

没有上面的性质,我们需要2遍才能解决这个问题。首先通过调用Matcher.matches()/String.matches()模式K*。第二次通过:

  • If the string does not match pattern K*, we will repeatedly use Matcher.find()on the pattern \GKuntil no more match can be found. This can be done due to how we define which prefix string to take when the input string does not match pattern K*.

  • If the string matches pattern K*, repeatedly use Matcher.find()on the pattern \GK(?=K*$)is one solution. This will result in redundant work matching the rest of the input string, though.

  • 如果字符串不匹配 pattern K*,我们将重复使用Matcher.find()该模式,\GK直到找不到更多匹配项。这可以通过我们如何定义在输入字符串与 pattern 不匹配时采用哪个前缀字符串来完成K*

  • 如果字符串匹配 pattern K*Matcher.find()在模式上重复使用\GK(?=K*$)是一种解决方案。但是,这将导致与输入字符串的其余部分匹配的冗余工作。

Note that this solution is universally applicable for any K. In other words, it also applies for the case where K*is equivalent to K*+(but we will use the better one-pass solution for that case instead).

请注意,此解决方案普遍适用于任何 K。换句话说,它也适用于 whereK*等价于的情况K*+(但我们将使用更好的一次性解决方案来代替这种情况)。

回答by Scott

Here is an additional answer to the already accepted one. Below is an example code snippet that only goes through the pattern once with m.find(), which is similar to your one pass solution, but will not parse non-matching lines.

这是对已经接受的答案的附加答案。下面是一个示例代码片段,它只通过模式一次 with m.find(),它类似于您的单遍解决方案,但不会解析不匹配的行。

import java.util.regex.*;

class test{
    public static void main(String args[]){
        String t = "<1><2><3>";
        Pattern pat = Pattern.compile("(<\d>)(?=(<\d>)*$)(?<=^(<\d>)*)");
        Matcher m = pat.matcher(t);
        while (m.find()) {
            System.out.println("Matches!");
            System.out.println(m.group());
        }       

    }
}

The regex explained:

正则表达式解释说:

<\\d>--This is your k pattern as defined above
?=-- positive lookahead (check what is ahead of K)
<\\d>*-- Match k 0 or more times
$-- End of line
?<=-- positive lookbehind (check what is behind K)
^-- beginning of line
<\\d>*-- followed by 0 or more Ks

<\\d>
?=-- 这是上面定义的 k 模式-- 正向前瞻(检查 K 前面的内容)
<\\d>*-- 匹配 k 0 次或多次
$-- 行尾
?<=-- 正向后视(检查 K 后面的内容)
^-- 开始行
<\\d>*-- 后跟 0 个或多个 K

Regular expressions are beautiful things.

正则表达式是美丽的东西。

Edit:As pointed out to me by @nhahtdh, this is just an implemented version of the answer. In fact the implementation above can be improved with the knowledge in the answer.
(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*)can be changed to \\G<\\d>(?=(<\\d>)*$).

编辑:正如@nhahtdh 向我指出的那样,这只是答案的一个实施版本。事实上,上面的实现可以通过答案中的知识来改进。
(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*)可以改为\\G<\\d>(?=(<\\d>)*$).

回答by Bernhard Barker

Below is a one-pass solution using Matcher.startand Matcher.end.

下面是使用Matcher.start和的一次性解决方案Matcher.end

boolean products(String products)
{
    String regex = "<[0-9]>";

    Pattern p = Pattern.compile(regex);

    Matcher matcher = p.matcher(products);
    int lastEnd = 0;
    while (matcher.find())
    {
        if (lastEnd != matcher.start())
           return false;
        System.out.println(matcher.group());
        lastEnd = matcher.end();
    }
    if (lastEnd != products.length())
        return false;
    return true;
}

The only disadvantage is that it will print out (or process) all values prior to finding invalid data.

唯一的缺点是它会在找到无效数据之前打印出(或处理)所有值。

For example, products("<1><2>a<3>");will print out:

例如,products("<1><2>a<3>");将打印出:

<1>
<2>

prior to throwing the exception (because up until there the string is valid).

在抛出异常之前(因为直到那里字符串有效)。

Either having this happen or having to store all of them temporarily seems to be unavoidable.

发生这种情况或必须暂时存储所有这些似乎是不可避免的。

回答by Joop Eggen

    String t = "<1><2><3>";
    Pattern pat = Pattern.compile("(<\d>)*");
    Matcher m = pat.matcher(t);
    if (m.matches()) {
        //String[] tt = t.split("(?<=>)"); // Look behind on '>'
        String[] tt = t.split("(?<=(<\d>))"); // Look behind on K
    }