.net RegexOptions.Compiled 如何工作？

Question

提问by Bob

What is going on behind the scenes when you mark a regular expression as one to be compiled? How does this compare/is different from a cached regular expression?

当您将正则表达式标记为要编译时，幕后发生了什么？这与缓存的正则表达式有何不同？

Using this information, how do you determine when the cost of computation is negligible compared to the performance increase?

使用此信息，您如何确定与性能提升相比何时计算成本可以忽略不计？

Answer 1

回答by Sam Saffron

RegexOptions.Compiledinstructs the regular expression engine to compile the regular expression expression into IL using lightweight code generation (LCG). This compilation happens during the construction of the object and heavilyslows it down. In turn, matches using the regular expression are faster.

RegexOptions.Compiled指示正则表达式引擎使用轻量级代码生成 ( LCG)将正则表达式表达式编译为 IL 。这种编译发生在对象的构建过程中，并大大减慢了它的速度。反过来，使用正则表达式的匹配速度更快。

If you do not specify this flag, your regular expression is considered "interpreted".

如果您不指定此标志，您的正则表达式将被视为“已解释”。

Take this example:

拿这个例子：

public static void TimeAction(string description, int times, Action func)
{
    // warmup
    func();

    var watch = new Stopwatch();
    watch.Start();
    for (int i = 0; i < times; i++)
    {
        func();
    }
    watch.Stop();
    Console.Write(description);
    Console.WriteLine(" Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}

static void Main(string[] args)
{
    var simple = "^\d+$";
    var medium = @"^((to|from)\W)?(?<url>http://[\w\.:]+)/questions/(?<questionId>\d+)(/(\w|-)*)?(/(?<answerId>\d+))?";
    var complex = @"^(([^<>()[\]\.,;:\s@""]+"
      + @"(\.[^<>()[\]\.,;:\s@""]+)*)|("".+""))@"
      + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
      + @"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
      + @"[a-zA-Z]{2,}))$";


    string[] numbers = new string[] {"1","two", "8378373", "38737", "3873783z"};
    string[] emails = new string[] { "[email protected]", "sss@s", "[email protected]", "[email protected]" };

    foreach (var item in new[] {
        new {Pattern = simple, Matches = numbers, Name = "Simple number match"},
        new {Pattern = medium, Matches = emails, Name = "Simple email match"},
        new {Pattern = complex, Matches = emails, Name = "Complex email match"}
    })
    {
        int i = 0;
        Regex regex;

        TimeAction(item.Name + " interpreted uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        i = 0;
        TimeAction(item.Name + " compiled uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern, RegexOptions.Compiled);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern);
        i = 0;
        TimeAction(item.Name + " prepared interpreted match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern, RegexOptions.Compiled);
        i = 0;
        TimeAction(item.Name + " prepared compiled match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

    }
}

It performs 4 tests on 3 different regular expressions. First it tests a singleonce off match (compiled vs non compiled). Second it tests repeat matches that reuse the same regular expression.

它对 3 个不同的正则表达式执行 4 次测试。首先，它测试单个一次性匹配（编译与未编译）。其次，它测试重复使用相同正则表达式的匹配项。

The results on my machine (compiled in release, no debugger attached)

我的机器上的结果（在发布中编译，没有附加调试器）

1000 single matches (construct Regex, Match and dispose)

1000 个单匹配（构建 Regex、Match 和 dispose）

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x86      |    4 ms        |    26 ms           |    31 ms
Interpreted | x64      |    5 ms        |    29 ms           |    35 ms
Compiled    | x86      |  913 ms        |  3775 ms           |  4487 ms
Compiled    | x64      | 3300 ms        | 21985 ms           | 22793 ms

1,000,000 matches - reusing the Regex object

1,000,000 个匹配项 - 重用 Regex 对象

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x86      |  422 ms        |   461 ms           |  2122 ms
Interpreted | x64      |  436 ms        |   463 ms           |  2167 ms
Compiled    | x86      |  279 ms        |   166 ms           |  1268 ms
Compiled    | x64      |  281 ms        |   176 ms           |  1180 ms

These results show that compiled regular expressions can be up to 60%faster for cases where you reuse the Regexobject. Howeverin some cases can be over 3 orders of magnitudeslower to construct.

这些结果表明，在重用对象的情况下，编译后的正则表达式可以快60%Regex。但是，在某些情况下，构建速度可能会慢3 个数量级以上。

It also shows that the x64 versionof .NET can be 5 to 6 times slowerwhen it comes to compilation of regular expressions.

它还表明，在编译正则表达式时，x64 版本的 .NET 可能会慢 5 到 6 倍。

The recommendation would be to use the compiled versionin cases where either

建议是在以下情况下使用编译版本

You do not care about object initialization cost and need the extra performance boost. (note we are talking fractions of a millisecond here)
You care a little bit about initialization cost, but are reusing the Regex object so many times that it will compensate for it during your application life cycle.

您不关心对象初始化成本，需要额外的性能提升。（请注意，我们在这里谈论的是几分之一毫秒）
您有点关心初始化成本，但是正在多次重用 Regex 对象，以至于它会在您的应用程序生命周期中对其进行补偿。

Spanner in the works, the Regex cache

Spanner 正在开发中，Regex 缓存

The regular expression engine contains an LRU cache which holds the last 15 regular expressions that were tested using the static methods on the Regexclass.

正则表达式引擎包含一个 LRU 缓存，该缓存保存使用Regex类上的静态方法测试的最后 15 个正则表达式。

For example: Regex.Replace, Regex.Matchetc.. all use the Regex cache.

例如：Regex.Replace，Regex.Match等全部使用正则表达式缓存。

The size of the cache can be increased by setting Regex.CacheSize. It accepts changes in size any time during your application's life cycle.

可以通过设置来增加缓存的大小Regex.CacheSize。它在您的应用程序生命周期中随时接受大小的更改。

New regular expressions are only cached by the static helperson the Regex class. If you construct your objects the cache is checked (for reuse and bumped), however, the regular expression you construct is not appended to the cache.

新的正则表达式仅由Regex 类上的静态帮助程序缓存。如果您构造对象，则会检查缓存（用于重用和碰撞），但是，您构造的正则表达式不会附加到缓存中。

This cache is a trivialLRU cache, it is implemented using a simple double linked list. If you happen to increase it to 5000, and use 5000 different calls on the static helpers, every regular expression construction will crawl the 5000 entries to see if it has previously been cached. There is a lockaround the check, so the check can decrease parallelism and introduce thread blocking.

这个缓存是一个普通的LRU 缓存，它是使用一个简单的双链表实现的。如果您碰巧将其增加到 5000，并在静态帮助器上使用 5000 次不同的调用，则每个正则表达式构造都将抓取 5000 个条目以查看它之前是否已被缓存。检查周围有一个锁，因此检查会降低并行度并引入线程阻塞。

The number is set quite low to protect yourself from cases like this, though in some cases you may have no choice but to increase it.

该数字设置得非常低，以保护自己免受此类情况的影响，但在某些情况下，您可能别无选择，只能增加它。

My strong recommendationwould be neverpass the RegexOptions.Compiledoption to a static helper.

我强烈推荐将永远传递RegexOptions.Compiled选项设置为静态辅助。

For example:

例如：

\ WARNING: bad code
Regex.IsMatch("10000", @"\d+", RegexOptions.Compiled)

The reason being that you are heavily risking a miss on the LRU cache which will trigger a super expensivecompile. Additionally, you have no idea what the libraries you depend on are doing, so have little ability to control or predict the best possiblesize of the cache.

原因是您冒着 LRU 缓存未命中的巨大风险，这将触发超级昂贵的编译。此外，您不知道所依赖的库在做什么，因此几乎无法控制或预测缓存的最佳可能大小。

回答by Tomalak

This entry in the BCL Team Blog gives a nice overview: "Regular Expression performance".

BCL 团队博客中的这个条目给出了一个很好的概述：“正则表达式性能”。

In short, there are three types of regex (each executing faster than the previous one):

简而言之，有三种类型的正则表达式（每种都比前一种执行得更快）：

interpreted
fast to create on the fly, slow to execute
compiled(the one you seem to ask about)
slower to create on the fly, fast to execute (good for execution in loops)
pre-compiled
create at compile time of your app (no run-time creation penalty), fast to execute

解释的
快速创建，执行缓慢
编译（你似乎要问的那个）
动态创建速度较慢，执行速度较快（适合在循环中执行）
预编译
在应用程序的编译时创建（没有运行时创建惩罚），执行速度快

So, if you intend to execute the regex only once, or in a non-performance-critical section of your app (i.e. user input validation), you are fine with option 1.

因此，如果您打算只执行一次正则表达式，或者在应用程序的非性能关键部分（即用户输入验证），您可以使用选项 1。

If you intend to run the regex in a loop (i.e. line-by-line parsing of file), you should go with option 2.

如果您打算在循环中运行正则表达式（即逐行解析文件），您应该使用选项 2。

If you have many regexes that will never change for your app and are used intensely, you could go with option 3.

如果您有许多永远不会为您的应用程序更改并且使用频繁的正则表达式，您可以选择选项 3。

Answer 3

回答by Robert Paulson

It should be noted that the performance of regular expressions since .NET 2.0 has been improved with an MRU cache of uncompiled regular expressions. The Regex library code no longer reinterprets the same un-compiled regular expression every time.

应该注意的是，自 .NET 2.0 以来，正则表达式的性能已通过未编译正则表达式的 MRU 缓存得到改进。Regex 库代码不再每次都重新解释相同的未编译正则表达式。

So there is potentially a bigger performance penaltywith a compiled and on the fly regular expression. In addition to slower load times, the system also uses more memory to compile the regular expression to opcodes.

因此，编译和动态正则表达式可能会导致更大的性能损失。除了较慢的加载时间外，系统还使用更多内存将正则表达式编译为操作码。

Essentially, the current advice is either do not compile a regular expression, or compile them in advance to a separate assembly.

本质上，当前的建议要么不编译正则表达式，要么提前将它们编译为单独的程序集。

Ref: BCL Team Blog Regular Expression performance [David Gutierrez]

参考：BCL 团队博客正则表达式性能 [David Gutierrez]

Answer 4

回答by Rob McCready

1) Base Class Library Team on compiled regex

1)编译正则表达式的基类库团队

2) Coding Horror, referencing #1 with some good points on the tradeoffs

2) Coding Horror，参考 #1 并在权衡上有一些优点

Answer 5

回答by Daniel Muthupandi

I hope the below code will help you to understand the concept of re.compile functions

我希望下面的代码能帮助你理解 re.compile 函数的概念

import re

x="""101 COM    Computers
205 MAT   Mathematics
189 ENG   English
222 SCI Science
333 TA  Tamil
5555 KA  Kannada
6666  TL  Telugu
777777 FR French
"""

#compile reg expression / successfully compiled regex can be used in any regex 
#functions    
find_subject_code=re.compile("\d+",re.M)
#using compiled regex in regex function way - 1
out=find_subject_code.findall(x)
print(out)
#using compiled regex in regex function way - 2
out=re.findall(find_numbers,x)
print(out)

#few more eg:
#find subject name
find_subjectnames=re.compile("(\w+$)",re.M) 
out=find_subjectnames.findall(x)
print(out)


#find subject SHORT name
find_subject_short_names=re.compile("[A-Z]{2,3}",re.M) 
out=find_subject_short_names.findall(x)
print(out)

.net RegexOptions.Compiled 如何工作？

提问by Bob

回答by Sam Saffron

1000 single matches (construct Regex, Match and dispose)

1000 个单匹配（构建 Regex、Match 和 dispose）

1,000,000 matches - reusing the Regex object

1,000,000 个匹配项 - 重用 Regex 对象

Spanner in the works, the Regex cache

Spanner 正在开发中，Regex 缓存

回答by Tomalak

回答by Robert Paulson

回答by Rob McCready

回答by Daniel Muthupandi

相关推荐

最近更新

标签

.net RegexOptions.Compiled 如何工作？

提问by Bob

回答by Sam Saffron

1000 single matches (construct Regex, Match and dispose)

1000 个单匹配（构建 Regex、Match 和 dispose）

1,000,000 matches - reusing the Regex object

1,000,000 个匹配项 - 重用 Regex 对象

Spanner in the works, the Regex cache

Spanner 正在开发中，Regex 缓存

回答by Tomalak

回答by Robert Paulson

回答by Rob McCready

回答by Daniel Muthupandi

相关推荐

WPF .NET 每分钟触发一个事件的最佳方式

.net msbuild，定义条件编译符号

.net CLR 和 CLI - 有什么区别？

.net WCF 配置 - 从 app.config 中拆分出来

相关推荐

最近更新

标签