Java 正则表达式 VS 包含。最棒的表演?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2023792/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
regex VS Contains. Best Performance?
提问by Mike
I want to compare an URI String over different patterns in java and I want fastest code possible.
我想比较 java 中不同模式的 URI 字符串,并且我想要最快的代码。
Should I use :
我应该使用:
if(uri.contains("/br/fab") || uri.contains("/br/err") || uri.contains("/br/sts")
Or something like :
或类似的东西:
if(uri.matches(".*/br/(fab|err|sts).*"))
Note that I can have a lot more uri and this method is called very often.
请注意,我可以拥有更多的 uri,并且此方法经常被调用。
What is the best answer between my choices ?
我的选择之间的最佳答案是什么?
采纳答案by Ewan Todd
They're both fast enough to be over before you know it. I'd go for the one that you can read more easily.
它们都足够快,可以在你知道之前结束。我会选择一个你可以更容易阅读的。
回答by Brian Agnew
I would expect contains()
to be faster since it won't have to compile and iterate through a (relatively) complex regular expression, but rather simply look for a sequence of characters.
我希望contains()
更快,因为它不必编译和迭代(相对)复杂的正则表达式,而只需查找一系列字符。
But (as with all optimisations) you should measure this. Your particular situation may impact results, to a greater or lesser degree.
但是(与所有优化一样)您应该对此进行衡量。您的特定情况可能会或多或少地影响结果。
Furthermore, is this known to be causing you grief (wrt. performance) ? If not, I wouldn't worry about it too much, and choose the most appropriate solution for your requirements regardless of performance issues. Prematureoptimisation will cause you an inordinate amount of grief if you let it!
此外,这是否会让您感到悲伤(wrt. performance)?如果没有,我也不会太担心,无论性能问题如何,都会为您的需求选择最合适的解决方案。如果任其发展,过早的优化会给您带来过多的痛苦!
回答by Jon Skeet
If you're going to use a regular expression, create it up-front and reuse the same Pattern
object:
如果您要使用正则表达式,请预先创建它并重用相同的Pattern
对象:
private static final Pattern pattern = Pattern.compile(".*/br/(fab|err|sts).*");
Do you actually need the ".*" at each end? I wouldn't expect it to be required, if you use Matcher.find()
.
你真的需要每一端的“.*”吗?如果您使用Matcher.find()
.
Which is faster? The easiest way to find out is to measure it against some sample data - with as realistic samples as possible. (The fastest solution may very well depend on
哪个更快?找出答案的最简单方法是根据一些样本数据对其进行测量 - 使用尽可能真实的样本。(最快的解决方案很可能取决于
Are you already sure this is a bottleneck though? If you've already measured the code enough to find out that it's a bottleneck, I'm surprised you haven't just tried both already. If you haven'tverified that it's a problem, that's the first thing to do before worrying about the "fastest code possible".
你已经确定这是一个瓶颈吗?如果您已经对代码进行了足够的测量以发现它是一个瓶颈,那么我很惊讶您还没有尝试过两者。如果您还没有确认这是一个问题,那么这是在担心“可能的最快代码”之前要做的第一件事。
If it's nota bottleneck, I would personally opt for the non-regex version unless you're a regex junkie. Regular expressions are very powerful, but also very easy to get wrong.
如果这不是瓶颈,我个人会选择非正则表达式版本,除非您是正则表达式迷。正则表达式非常强大,但也很容易出错。
回答by ZoFreX
If the bit you are trying to match against is always at the beginning, or end, or is in some other way predictable then: neither!
如果您尝试匹配的位总是在开头或结尾,或者以其他方式可预测,那么:两者都不是!
For example, if urls are like http://example.com/br/fabor http://example.com/br/errall the time, then you could store "br/fab" and "br/err" etc in a HashSet or similar, and then given an incoming URL, chop off the last part of it and query the Set to see if it contains it. This will scale better than either method you gave (with a HashSet it should get no slower to lookup entries no matter how many there are).
例如,如果 url 一直像http://example.com/br/fab或http://example.com/br/err,那么你可以存储“br/fab”和“br/err”等在 HashSet 或类似的东西中,然后给定一个传入的 URL,切掉它的最后一部分并查询 Set 以查看它是否包含它。这将比您提供的任何一种方法都具有更好的扩展性(使用 HashSet,无论有多少条目,查找条目都不会变慢)。
If you do need to match against substrings appearing in arbitrary locations... it depends what you mean by "a lot more". One thing you should do regardless of the specifics of the problem is try things out and benchmark them!
如果您确实需要匹配出现在任意位置的子字符串……这取决于您所说的“更多”是什么意思。无论问题的具体情况如何,您都应该做的一件事是尝试并对其进行基准测试!
回答by Mike
I've done a test and it is faster to use contains. As Ewan Todd said, they both fast enough to don't really bother with that.
我做了一个测试,使用包含更快。正如 Ewan Todd 所说,他们都足够快,不会真正为此烦恼。
回答by Panthro
UPDATE:I know this is not the best benchmark code and for each case there are several ways to optimize it.
更新:我知道这不是最好的基准代码,对于每种情况,都有几种优化它的方法。
What I wanted to achieve was, for a regular developer that will use the simpler ways of doing things and it's not a JVM expert, that's the "common" way to use it, so here it goes.
我想要实现的是,对于将使用更简单的做事方式并且不是 JVM 专家的普通开发人员来说,这是使用它的“常见”方式,所以它就在这里。
ORIGINAL:
原来的:
The below code produced the following output
下面的代码产生了以下输出
contains took: 70
matches took: 113
matches with pre pattern took: 419
The test class
测试类
public class MatchesTester {
public static void main(String[] args) {
String typeStr = "Nunc rhoncus odio ac tellus pulvinar, et volutpat sapien aliquet. Nam sed libero nec ex laoreet pretium sed id mi. Aliquam erat volutpat. Aenean at erat vitae massa iaculis mattis. Quisque sagittis massa orci, sit amet vestibulum turpis tempor a. Etiam eget venenatis arcu. Nunc enim augue, pulvinar at nulla ut, pellentesque porta sapien. Maecenas ut erat id nisi tincidunt faucibus eget vel erat. Morbi quis magna et massa pharetra venenatis ut a lacus. Vivamus egestas vitae nulla eget tristique. Praesent consectetur, tellus quis bibendum suscipit, nisl turpis mattis sapien, ultrices mollis leo quam eu eros.application/binaryNunc rhoncus odio ac tellus pulvinar, et volutpat sapien aliquet. Nam sed libero nec ex laoreet pretium sed id mi. Aliquam erat volutpat. Aenean at erat vitae massa iaculis mattis. Quisque sagittis massa orci, sit amet vestibulum turpis tempor a. Etiam eget venenatis arcu. Nunc enim augue, pulvinar at nulla ut, pellentesque porta sapien. Maecenas ut erat id nisi tincidunt faucibus eget vel erat. Morbi quis magna et massa pharetra venenatis ut a lacus. Vivamus egestas vitae nulla eget tristique. Praesent consectetur, tellus quis bibendum suscipit, nisl turpis mattis sapien, ultrices mollis leo quam eu eros.";
int timesToTest = 10000;
long start = System.currentTimeMillis();
int count = 0;
//test contains
while(count < timesToTest){
if (typeStr.contains("image") || typeStr.contains("audio") || typeStr.contains("video") || typeStr.contains("application")) {
//do something non expensive like creating a simple native var
int a = 10;
}
count++;
}
long end = System.currentTimeMillis();
System.out.println("contains took: "+ (end - start));
long start2 = System.currentTimeMillis();
count = 0;
while(count < timesToTest){
if (typeStr.matches("(image|audio|video|application)")) {
//do something non expensive like creating a simple native var
int a = 10;
}
count++;
}
long end2 = System.currentTimeMillis(); //new var to have the same cost as contains
System.out.println("matches took: "+ (end2 - start2));
long start3 = System.currentTimeMillis();
count = 0;
Pattern pattern = Pattern.compile("(image|audio|video|application)");
while(count < timesToTest){
if (pattern.matcher(typeStr).find()) {
//do something non expensive like creating a simple native var
int a = 10;
}
count++;
}
long end3 = System.currentTimeMillis(); //new var to have the same cost as contains
System.out.println("matches with pre pattern took: "+ (end3 - start3));
}
回答by Nit
its much faster if you use indexOf().
如果您使用 indexOf(),它会快得多。
if(uri.indexOf("/br/fab")>-1 || uri.indexOf("/br/err")>-1 || uri.indexOf("/br/sts") >-1 )
{
your code.
}
and problem with contains() is internally it creates a Matcher(java.util.regex.Matcher) object and evalates the expression.
contains() 的问题在于它在内部创建一个 Matcher(java.util.regex.Matcher) 对象并评估表达式。
Matcher is a very costly thing if processing large amount of data.
如果处理大量数据,Matcher 是一个非常昂贵的东西。
回答by aatmopaw
Both are fast enough, but containsis faster. Facts: ~20mil ops vs ~1mil ops
两者都足够快,但包含更快。事实:约 2000 万次操作 vs 约 100 万次操作
Using the following jmh code to test
使用以下jmh代码进行测试
@State(Scope.Benchmark)
public class Main {
private String uri = "https://google.com/asdfasdf/ptyojty/aeryethtr";
@Benchmark
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@Fork(value = 1, warmups = 0)
public void initContains() throws InterruptedException {
if (uri.contains("/br/fab") || uri.contains("/br/err") || uri.contains("/br/sts")) {}
}
@Benchmark
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@Fork(value = 1, warmups = 0)
public void initMatches() throws InterruptedException {
if (uri.matches(".*/br/(fab|err|sts).*")) {}
}
public static void main(String[] args) throws Exception {
org.openjdk.jmh.Main.main(args);
}
}
The results
结果
# Run complete. Total time: 00:00:37
Benchmark Mode Cnt Score Error Units
Main.initContains thrpt 5 21004897.968 ± 1987176.746 ops/s
Main.initMatches thrpt 5 1177562.581 ± 248488.092 ops/s