C# \d 的效率低于 [0-9]

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16621738/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 01:30:10  来源:igfitidea点击:

\d is less efficient than [0-9]

c#regexperformance

提问by weston

I made a comment yesterday on an answer where someone had used [0123456789]in a regular expressionrather than [0-9]or \d. I said it was probably more efficient to use a range or digit specifier than a character set.

我昨天对某人[0123456789]正则表达式中使用而不是[0-9]or的答案发表了评论\d。我说过使用范围或数字说明符可能比使用字符集更有效。

I decided to test that out today and found out to my surprise that (in the C# regex engine at least) \dappears to be less efficient than either of the other two which don't seem to differ much. Here is my test output over 10000 random strings of 1000 random characters with 5077 actually containing a digit:

我决定今天测试一下,令我惊讶的是(至少在 C# 正则表达式引擎中)\d似乎比其他两个似乎没有太大区别的效率低。这是我的测试输出超过 10000 个随机字符串的 1000 个随机字符,其中 5077 实际上包含一个数字:

Regular expression \d           took 00:00:00.2141226 result: 5077/10000
Regular expression [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first
Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first

It's a surprise to me for two reasons:

这让我感到惊讶有两个原因:

  1. I would have thought the range would be implemented much more efficiently than the set.
  2. I can't understand why \dis worse than [0-9]. Is there more to \dthan simply shorthand for [0-9]?
  1. 我原以为范围会比集合更有效地实现。
  2. 我不明白为什么\d[0-9]. \d除了简单的简写之外还有更多[0-9]吗?

Here is the test code:

下面是测试代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace SO_RegexPerformance
{
    class Program
    {
        static void Main(string[] args)
        {
            var rand = new Random(1234);
            var strings = new List<string>();
            //10K random strings
            for (var i = 0; i < 10000; i++)
            {
                //Generate random string
                var sb = new StringBuilder();
                for (var c = 0; c < 1000; c++)
                {
                    //Add a-z randomly
                    sb.Append((char)('a' + rand.Next(26)));
                }
                //In roughly 50% of them, put a digit
                if (rand.Next(2) == 0)
                {
                    //Replace one character with a digit, 0-9
                    sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
                }
                strings.Add(sb.ToString());
            }

            var baseTime = testPerfomance(strings, @"\d");
            Console.WriteLine();
            var testTime = testPerfomance(strings, "[0-9]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            testTime = testPerfomance(strings, "[0123456789]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
        }

        private static TimeSpan testPerfomance(List<string> strings, string regex)
        {
            var sw = new Stopwatch();

            int successes = 0;

            var rex = new Regex(regex);

            sw.Start();
            foreach (var str in strings)
            {
                if (rex.Match(str).Success)
                {
                    successes++;
                }
            }
            sw.Stop();

            Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);

            return sw.Elapsed;
        }
    }
}

采纳答案by Sina Iravanian

\dchecks all Unicode digits, while [0-9]is limited to these 10 characters. For example, Persiandigits, ?????????, are an example of Unicode digits which are matched with \d, but not [0-9].

\d检查所有 Unicode 数字,而[0-9]仅限于这 10 个字符。例如,波斯数字?????????是 Unicode 数字的一个示例,它们与 匹配\d,但不与 匹配[0-9]

You can generate a list of all such characters using the following code:

您可以使用以下代码生成所有此类字符的列表:

var sb = new StringBuilder();
for(UInt16 i = 0; i < UInt16.MaxValue; i++)
{
    string str = Convert.ToChar(i).ToString();
    if (Regex.IsMatch(str, @"\d"))
        sb.Append(str);
}
Console.WriteLine(sb.ToString());

Which generates:

产生:

0123456789??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????0123456789

0123456789???????????????????????????????????????????????? ?????????????????????????????????????????????????????? ?????????????????????????????????????????????????????? ?????????????????????????????????????????????????????? ?????????????????????????????????????????????????????? ??????????????????????????????????????????0123456789

回答by ?smet Alkan

From Does “\d” in regex mean a digit?:

是否“\ d”在正则表达式的意思是数字吗?

[0-9]isn't equivalent to \d. [0-9]matches only 0123456789characters, while \dmatches [0-9]and other digit characters, for example Eastern Arabic numerals ??????????

[0-9]不等于\d. [0-9]只匹配0123456789字符,而\d匹配[0-9]和其他数字字符,例如东方阿拉伯数字??????????

回答by weston

Credit to ByteBlast for noticing this in the docs. Just changing the regex constructor:

感谢 ByteBlast 在文档中注意到这一点。只需更改正则表达式构造函数:

var rex = new Regex(regex, RegexOptions.ECMAScript);

Gives new timings:

给出新的时间:

Regex \d           took 00:00:00.1355787 result: 5077/10000
Regex [0-9]        took 00:00:00.1360403 result: 5077/10000  100.34 % of first
Regex [0123456789] took 00:00:00.1362112 result: 5077/10000  100.47 % of first

回答by Sebastian

An addition to top answerfrom Sina Iravianian, here is a .NET 4.5 version (since only that version supports UTF16 output, c.f. the first three lines) of his code, using the full range of Unicode code points. Due to the lack of proper support for higher Unicode planes, many people are not aware of always checking for and including the upper Unicode planes. Nevertheless they sometimes do contain some important characters.

除了来自Sina Iravianian 的最佳答案之外,这里是他的代码的 .NET 4.5 版本(因为只有该版本支持 UTF16 输出,请参见前三行),使用了全部的 Unicode 代码点。由于缺乏对更高 Unicode 平面的适当支持,许多人没有意识到始终检查并包含更高的 Unicode 平面。尽管如此,它们有时确实包含一些重要的字符。

Update

更新

Since \ddoes not support non-BMP characters in regex (thanks xanatos), here a version that uses the Unicode character database

由于\d不支持正则表达式中的非 BMP 字符(感谢xanatos),这里是使用 Unicode 字符数据库的版本

public static void Main()
{
    var unicodeEncoding = new UnicodeEncoding(!BitConverter.IsLittleEndian, false);
    Console.InputEncoding = unicodeEncoding;
    Console.OutputEncoding = unicodeEncoding;

    var sb = new StringBuilder();
    for (var codePoint = 0; codePoint <= 0x10ffff; codePoint++)
    {
        var isSurrogateCodePoint = codePoint <= UInt16.MaxValue 
               && (  char.IsLowSurrogate((char) codePoint) 
                  || char.IsHighSurrogate((char) codePoint)
                  );

        if (isSurrogateCodePoint)
            continue;

        var codePointString = char.ConvertFromUtf32(codePoint);

        foreach (var category in new []{
        UnicodeCategory.DecimalDigitNumber,
            UnicodeCategory.LetterNumber,
            UnicodeCategory.OtherNumber})
        {
        sb.AppendLine($"{category}");
            foreach (var ch in charInfo[category])
        {
                sb.Append(ch);
            }
            sb.AppendLine();
        }
    }
    Console.WriteLine(sb.ToString());

    Console.ReadKey();
}

Yielding the following output:

产生以下输出:

DecimalDigitNumber 0123456789??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????0123456789

LetterNumber

???ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫ????ⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹ?????????????〇〡〢〣〤〥〦〧〨〩?????????????

OtherNumber 231??????????????????????????????????????????????????????????????????????????????????????????????????????????①②③④⑤⑥⑦⑧⑨⑩??????????⑴⑵⑶⑷⑸⑹⑺⑻⑼⑽⑾⑿⒀⒁⒂⒃⒄⒅⒆⒇⒈⒉⒊⒋⒌⒍⒎⒏⒐⒑⒒⒓⒔⒕⒖⒗⒘⒙⒚⒛?????????????????????????????????????????????????????一二三四㈠㈡㈢㈣㈤㈥㈦㈧㈨㈩???????????????????????一二三四五六七八九十?????????????????????

DecimalDigitNumber 0123456789???????????????????????????????????????????????? ?????????????????????????????????????????????????????? ?????????????????????????????????????????????????????? ?????????????????????????????????????????????????????? ?????????????????????????????????????????????????????? ?????????????????????????????????????????????????????? ?????????????????????????????????????????????????????? ??0123456789

字母编号

???ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫ????ⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹ????????????〇〡〢〣〤〥〣〤〥〦〧〨〩??????

OtherNumber 231???????????????????????????????????????????????? ?????????????????????????????????????????????????????? ??? ????????????????????????????一二三四㈢㈢㈣㈤㈥㈦㈧㈨㈩?????????????????????? ?一二三四五六七八九十?????????????????????

回答by dengkai

\d checks all Unicode, while [0-9] is limited to these 10 characters. If just 10 digits, you should use. Others I recommend using \d,Because writing less.

\d 检查所有 Unicode,而 [0-9] 仅限于这 10 个字符。如果只有 10 位数字,则应使用。其他的我推荐用\d,因为写的少。