正则表达式优化 - 在 Java 中转义符号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/848231/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 14:02:48  来源:igfitidea点击:

Regex optimisation - escaping ampersands in java

javaregexoptimization

提问by Duveit

I need to replace all & in a String that isnt part of a HTML entity. So that the String "This & entites >& <" will return "This &entites > & <"

我需要替换不属于 HTML 实体的字符串中的所有 &。这样字符串“This & entites >& <”将返回“This &entites > & <

And I've come up with this regex-pattern: "&[a-zA-Z0-9]{2,7};" which works fine. But I'm not very skilled in regex, and when I test the speed over 100k iterations, it uses double amount of time over a previous used method, that didnt use regex. (But werent working 100% either).

我想出了这个正则表达式模式:“ &[a-zA-Z0-9]{2,7};”,它工作正常。但是我在正则表达式方面不是很熟练,当我测试超过 100k 次迭代的速度时,它使用的时间比以前使用的方法多两倍,没有使用正则表达式。(但也没有 100% 工作)。

Testcode:

测试代码:

long time = System.currentTimeMillis();
String reg = "&(?!&#?[a-zA-Z0-9]{2,7};)";
String s="a regex test 1 & 2  1&2 and &_gt; - &_lt;"
for (int i = 0; i < 100000; i++) {test=s.replaceAll(reg, "&amp;");}
System.out.println("Finished in:" + (System.currentTimeMillis() - time) + " milliseconds");

So the question would be whether there is some obvious ways of optimize this regex expression for it to be more effective?

所以问题是是否有一些明显的方法可以优化这个正则表达式以使其更有效?

回答by Chris Thornhill

s.replaceAll(reg, "&amp;")is compiling the regular expression every time. Compiling the pattern once will provide some increase in performance (~30% in this case).

s.replaceAll(reg, "&amp;")每次都在编译正则表达式。编译一次模式将提供一些性能提升(在这种情况下约为 30%)。

long time = System.currentTimeMillis();
String reg = "&(?!&#?[a-zA-Z0-9]{2,7};)";
Pattern p = Pattern.compile(reg);
String s="a regex test 1 & 2  1&2 and &_gt; - &_lt;";
for (int i = 0; i < 100000; i++) {
    String test = p.matcher(s).replaceAll("&amp;");
}
System.out.println("Finished in:" + 
             (System.currentTimeMillis() - time) + " milliseconds");

回答by Gumbo

You have to exclude the &from your look-ahead assertion. So try this regular expression:

您必须&从超前断言中排除。所以试试这个正则表达式:

&(?!#?[a-zA-Z0-9]{2,7};)

Or to be more precise:

或者更准确地说:

&(?!(?:#(?:[xX][0-9a-fA-F]|[0-9]+)|[a-zA-Z]+);)

回答by Valentin Rocher

Another way of doing this wihtout blowing your head with regexp would be to use StringEscapeUtilsfrom Commons Lang.

这样做wihtout用正则表达式吹你的头的另一种方法是使用StringEscapeUtils下议院郎

回答by John Weldon

I'm not very familiar with the Java regex classes, but in general you may want to investigate a zero width lookahead for ; after the ampersand.

我对 Java regex 类不是很熟悉,但一般来说,您可能希望调查 ; 的零宽度前瞻。在&符号之后。

Here is a linkdescribing positive and negative lookaheads

这是一个描述正面和负面预测的链接