java Java中的正则表达式执行速度太慢

Question

提问by carpediem

My purpose is to match this kind of different urls:
url.com
my.url.com
my.extended.url.com
a.super.extended.url.com
and so on...

我的目的是匹配这种不同的网址：
url.com
my.url.com
my.extended.url.com
a.super.extended.url.com
等等...

So, I decided to build the regex to have a letter or a number at start and end of the url, and to have a infinite number of "subdomains" with alphanumeric characters and a dot. For example, in "my.extended.url.com", "m" from "my" is the first class of the regex, "m" from "com" is the last class of the regex, and "y.", "extended." and "url." are the second class of the regex.

因此，我决定构建正则表达式，在 url 的开头和结尾处有一个字母或一个数字，并有无数个带有字母数字字符和一个点的“子域”。例如，在“my.extended.url.com”中，“my”中的“m”是正则表达式的第一类，“com”中的“m”是正则表达式的最后一类，“y.”， “延长了。” 和“网址”。是正则表达式的第二类。

Using the pattern and subject in the code below, I want the find method to return me a false because this url must not match, but it uses 100% of CPU and seems to stay in an infinite loop.

使用下面代码中的模式和主题，我希望 find 方法返回一个 false 因为这个 url 必须不匹配，但它使用了 100% 的 CPU 并且似乎停留在无限循环中。

    
    String subject = "www.association-belgo-palestinienne-be";
    Pattern pattern = Pattern.compile("^[A-Za-z0-9]\.?([A-Za-z0-9_-]+\.?)*[A-Za-z0-9]\.[A-Za-z]{2,6}");

    Matcher m = pattern.matcher(subject);
    System.out.println("    Start");
    boolean hasFind = m.find();
    System.out.println("    Finish : " + hasFind);

Which only prints:

只打印：

  
      Start

I can't reproduce the problem using regex testers.
Is it normal ? Is the problem coming from my regex ?
Could it be due to my Java version (1.6.0_22-b04 / JVM 64 bit 17.1-b03) ?

我无法使用正则表达式测试器重现该问题。
正常吗？问题是否来自我的正则表达式？
可能是因为我的 Java 版本 (1.6.0_22-b04 / JVM 64 bit 17.1-b03) 吗？

Thanks in advance for helping.

提前感谢您的帮助。

Answer 1

回答by Avi

The problem is the ([A-Za-z0-9_-]+\\.?)*part of the regular expression. Note that it has a quantifier (+) inside another quantifier (*). This causes catastrophic backtracking- basically, it has to try an exponential number of matches in order to check the regular expression, at least the way most regular expression engines are implemented (including the Java one).

问题是([A-Za-z0-9_-]+\\.?)*正则表达式的一部分。请注意，它在另一个量词 (*) 内有一个量词 (+)。这会导致灾难性的回溯——基本上，它必须尝试指数数量的匹配来检查正则表达式，至少是大多数正则表达式引擎的实现方式（包括 Java 引擎）。

If you use possessive quantifiers, you will be able to avoid this problem, however that would change the meaning of your regex, and it would no longer match what you want it to match.

如果您使用所有格量词，您将能够避免这个问题，但是这会改变您的正则表达式的含义，并且它将不再匹配您希望它匹配的内容。

I think the trick here is to find a regex which expresses what you want to solve, without double quantifiers. For example, the following should work:

我认为这里的技巧是找到一个正则表达式来表达你想要解决的问题，没有双量词。例如，以下应该有效：

Pattern.compile("^[A-Za-z0-9]\.?([A-Za-z0-9_-]|[A-Za-z0-9_-]\.)*[A-Za-z0-9]\.[A-Za-z]{2,6}$");

I think this expresses the same class of strings that you are trying to match, and should be much faster.

我认为这表示您尝试匹配的同一类字符串，并且应该更快。

Answer 2

回答by moinudin

It's not an infinite loop. The problem is that it's checking every possible match and not finding one. If you could let it run for a gazillion years, it will eventually terminate. See this articlefor a good explanation of what's happening under the hood.

这不是无限循环。问题是它正在检查每一个可能的匹配项，但没有找到一个。如果你能让它运行无数年，它最终会终止。请参阅这篇文章，以很好地解释幕后发生的事情。

Perhaps this regular expression is satisfactory (it terminates on the given string): ^[A-Za-z0-9][A-Za-z0-9_-]*(\\.[A-Za-z0-9_-]+)*\\.[A-Za-z]{2,6}$(see http://ideone.com/Z0rlg)

也许这个正则表达式是令人满意的（它以给定的字符串终止）：（^[A-Za-z0-9][A-Za-z0-9_-]*(\\.[A-Za-z0-9_-]+)*\\.[A-Za-z]{2,6}$参见http://ideone.com/Z0rlg）

Answer 3

回答by LaGrandMere

It isn't really an infinite loop, it's just taking a reallylong time. For all practical purposes, we can call it a hang.

这不是真正的无限循环，只是需要很长时间。出于所有实际目的，我们可以称之为挂起。

Your Regex may be improved.

您的正则表达式可能会得到改进。

Try to put $ at the end of it. It will say that this is the end of the line. It may help you saving time.

尝试将 $ 放在它的末尾。它会说这是该行的结尾。它可以帮助您节省时间。

Edit:

编辑：

 String subject = "www-association-belgo-palestinienne-be";
 Pattern pattern = Pattern.compile("^[A-Za-z0-9]([-_A-Za-z0-9]*)(\.[-_A-Za-z0-9]+)*\.([-_A-Za-z0-9]+\.)*([-_A-Za-z0-9]*)[A-Za-z0-9]$");

 Matcher m = pattern.matcher(subject);
 System.out.println("    Start");
 boolean hasFind = m.find();
 System.out.println("    Finish : " + hasFind);

Answer 4

回答by Yuval F

See How do you debug a regex?.

请参阅如何调试正则表达式？.

Specifically, I would try regexpal, and change the java backslashes to single ones.

具体来说，我会尝试regexpal，并将 java 反斜杠更改为单个反斜杠。

Answer 5

回答by foo

It is an obvious Java regexp implementation bug. Look at the results with Your regexp and input data here

这是一个明显的 Java regexp 实现错误。使用您的正则表达式查看结果并在此处输入数据

and You will see how quickly this is evaluated

你会看到评估的速度有多快

java Java中的正则表达式执行速度太慢

提问by carpediem

回答by Avi

回答by moinudin

回答by LaGrandMere

回答by Yuval F

回答by foo

相关推荐

最近更新

标签

java Java中的正则表达式执行速度太慢

提问by carpediem

回答by Avi

回答by moinudin

回答by LaGrandMere

回答by Yuval F

回答by foo

相关推荐

如何在运行时更改 Java 属性？

java 如何将 javax.microedition.lcdui 包安装到 Netbeans 中

java web.xml 中 <filter> 标签的用途是什么？

java ThreadPool 不按顺序运行任务

相关推荐

最近更新

标签