java正则表达式:性能和替代

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19829892/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 20:32:55  来源:igfitidea点击:

java regular expressions: performance and alternative

javaregex

提问by Joseph_Marzbani

Recently I have been had to search a number of string values to see which one matches a certain pattern. Neither the number of string values nor the pattern itself is clear until a search term has been entered by the user. The problem is I have noticed each time my application runs the following line:

最近,我不得不搜索许多字符串值以查看哪个与某个模式匹配。在用户输入搜索词之前,字符串值的数量和模式本身都不清楚。问题是每次我的应用程序运行以下行时我都注意到:

    if (stringValue.matches (rexExPattern))
    {
        // do something so simple
    }

it takes about 40 micro second. No need to say when the number of string values exceeds a few thousands, it'll be too slow.

大约需要 40 微秒。不用说,当字符串值超过几千的时候,就太慢了。

The pattern is something like:

模式是这样的:

    "A*B*C*D*E*F*"

where A~F are just examples here, but the pattern is some thing like the above. Please note* that the pattern actually changes per search. For example "A*B*C*" may change to W*D*G*A*".

其中 A~F 只是这里的示例,但模式与上述类似。请注意* 每次搜索时模式实际上都会发生变化。例如,"A*B*C*" 可能会更改为 W*D*G*A*"。

I wonder if there is a better substitution for the above pattern or, more generally, an alternative for java regular expressions.

我想知道是否有更好的替代上述模式,或者更一般地说,是 java 正则表达式的替代方案。

采纳答案by Seelenvirtuose

Regular expressions in Java are compiled into an internal data structure. This compilation is the time-consuming process. Each time you invoke the method String.matches(String regex), the specified regular expression is compiled again.

Java 中的正则表达式被编译成内部数据结构。这个编译是一个耗时的过程。每次调用该方法时String.matches(String regex),都会重新编译指定的正则表达式。

So you should compile your regular expression only once and reuse it:

所以你应该只编译你的正则表达式一次并重用它:

Pattern pattern = Pattern.compile(regexPattern);
for(String value : values) {
    Matcher matcher = pattern.matcher(value);
    if (matcher.matches()) {
        // your code here
    }
}

回答by Jason C

Consider the following (quick and dirty) test:

考虑以下(快速而肮脏的)测试:

import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test3 {

    // time that tick() was called
    static long tickTime;

    // called at start of operation, for timing
    static void tick () {
        tickTime = System.nanoTime();
    }

    // called at end of operation, prints message and time since tick().
    static void tock (String action) {
        long mstime = (System.nanoTime() - tickTime) / 1000000;
        System.out.println(action + ": " + mstime + "ms");
    }

    // generate random strings of form AAAABBBCCCCC; a random 
    // number of characters each randomly repeated.
    static List<String> generateData (int itemCount) {

        Random random = new Random();
        List<String> items = new ArrayList<String>();
        long mean = 0;

        for (int n = 0; n < itemCount; ++ n) {
            StringBuilder s = new StringBuilder();
            int characters = random.nextInt(7) + 1;
            for (int k = 0; k < characters; ++ k) {
                char c = (char)(random.nextInt('Z' - 'A') + 'A');
                int rep = random.nextInt(95) + 5;
                for (int j = 0; j < rep; ++ j)
                    s.append(c);
                mean += rep;
            }
            items.add(s.toString());
        }

        mean /= itemCount;
        System.out.println("generated data, average length: " + mean);

        return items;

    }

    // match all strings in items to regexStr, do not precompile.
    static void regexTestUncompiled (List<String> items, String regexStr) {

        tick();

        int matched = 0, unmatched = 0;

        for (String item:items) {
            if (item.matches(regexStr))
                ++ matched;
            else
                ++ unmatched;
        }

        tock("uncompiled: regex=" + regexStr + " matched=" + matched + 
             " unmatched=" + unmatched);

    }

    // match all strings in items to regexStr, precompile.
    static void regexTestCompiled (List<String> items, String regexStr) {

        tick();

        Matcher matcher = Pattern.compile(regexStr).matcher("");
        int matched = 0, unmatched = 0;

        for (String item:items) {
            if (matcher.reset(item).matches())
                ++ matched;
            else
                ++ unmatched;
        }

        tock("compiled: regex=" + regexStr + " matched=" + matched + 
             " unmatched=" + unmatched);

    }

    // test all strings in items against regexStr.
    static void regexTest (List<String> items, String regexStr) {

        regexTestUncompiled(items, regexStr);
        regexTestCompiled(items, regexStr);

    }

    // generate data and run some basic tests
    public static void main (String[] args) {

        List<String> items = generateData(1000000);
        regexTest(items, "A*");
        regexTest(items, "A*B*C*");
        regexTest(items, "E*C*W*F*");

    }

}

Strings are random sequences of 1-8 characters with each character occurring 5-100 consecutive times (e.g. "AAAAAAGGGGGDDFFFFFF"). I guessed based on your expressions.

字符串是 1-8 个字符的随机序列,每个字符连续出现 5-100 次(例如“AAAAAAGGGGGDDFFFFFF”)。我是根据你的表情猜的。

Granted this might not be representative of your data set, but the timing estimates for applying those regular expressions to 1 million randomly generates strings of average length 208 each on my modest 2.3 GHz dual-core i5 was:

当然,这可能不能代表您的数据集,但将这些正则表达式应用于 100 万个随机生成平均长度为 208 的字符串的时间估计在我适度的 2.3 GHz 双核 i5 上是:

Regex      Uncompiled    Precompiled
A*          0.564 sec     0.126 sec
A*B*C*      1.768 sec     0.238 sec
E*C*W*F*    0.795 sec     0.275 sec

Actual output:

实际输出:

generated data, average length: 208
uncompiled: regex=A* matched=6004 unmatched=993996: 564ms
compiled: regex=A* matched=6004 unmatched=993996: 126ms
uncompiled: regex=A*B*C* matched=18677 unmatched=981323: 1768ms
compiled: regex=A*B*C* matched=18677 unmatched=981323: 238ms
uncompiled: regex=E*C*W*F* matched=25495 unmatched=974505: 795ms
compiled: regex=E*C*W*F* matched=25495 unmatched=974505: 275ms

Even without the speedup of precompiled expressions, and even considering that the results vary wildly depending on the data set and regular expression (and even considering that I broke a basic rule of proper Java performance tests and forgot to prime HotSpot first), this is very fast, and I still wonder if the bottleneck is truly where you think it is.

即使没有预编译表达式的加速,甚至考虑到结果因数据集和正则表达式而异(甚至考虑到我违反了正确的 Java 性能测试的基本规则并忘记首先准备 HotSpot),这也是非常很快,我仍然想知道瓶颈是否真的在你认为的地方。

After switching to precompiled expressions, if you still are not meeting your actual performance requirements, do some profiling. If you find your bottleneck is still in your search, consider implementing a more optimized search algorithm.

切换到预编译表达式后,如果您仍然不能满足您的实际性能要求,请进行一些分析。如果您发现瓶颈仍在搜索中,请考虑实施更优化的搜索算法。

For example, assuming your data set is like my test set above: If your data set is known ahead of time, reduce each item in it to a smaller string key by removing repetitive characters, e.g. for "AAAAAAABBBBCCCCCCC", store it in a map of some sort keyed by "ABC". When a user searches for "ABC*" (presuming your regex's are in that particular form), look for "ABC" items. Or whatever. It highly depends on your scenario.

例如,假设您的数据集类似于我上面的测试集:如果您的数据集提前已知,则通过删除重复字符将其中的每个项目减少为较小的字符串键,例如“AAAAAAABBBBCCCCCCC”,将其存储在映射中以“ABC”为键的某种类型。当用户搜索“A BC*”(假设您的正则表达式采用该特定形式)时,请查找“ABC”项目。管他呢。这在很大程度上取决于您的场景。