正则表达式不区分大小写的速度较慢吗?
时间:2020-03-05 18:44:18 来源:igfitidea点击:
来源
RegexOptions.IgnoreCase is more expensive than I would have thought (eg, should be barely measurable)
假设这适用于PHP,Python,Perl,Ruby等以及C(我假设Jeff使用的是C),那么它的速度有多慢,我会因/ [a-zA- z] /
,就像我要使用/ [az] / i
一样?
解决方案
回答
是的,[A-Za-z]比设置RegexOptions.IgnoreCase
要快得多,这在很大程度上是因为Unicode字符串。但这还具有更大的局限性-[A-Za-z]与重音国际字符不匹配,实际上是A-Za-z ASCII集,仅此而已。
我不知道我们是否看到蒂姆·布雷(Tim Bray)对我的信息的回答,但这是一个很好的消息:
One of the trickiest issues in internationalized search is upper and lower case. This notion of case is limited to languages written in the Latin, Greek, and Cyrillic character sets. English-speakers naturally expect search to be case-insensitive if only because they’re lazy: if Nadia Jones wants to look herself up on Google she’ll probably just type in nadia jones and expect the system to take care of it. So it’s fairly common for search systems to “normalize” words by converting them all to lower- or upper-case, both for indexing and queries. The trouble is that the mapping between cases is not always as straightforward as it is in English. For example, the German lower-case character “?” becomes “SS” when upper-cased, and good old capital “I” when down-cased in Turkish becomes the dotless “?” (yes, they have “i”, its upper-case version is “?”). I have read (but not verified first-hand) that the rules for upcasing accented characters such “é” are different in France and Québec. One of the results of all this is that software such as java.String.toLowerCase() tends to run astonishingly slow as it tries to work around all these corner-cases.
http://www.tbray.org/ongoing/When/200x/2003/10/11/SearchI18n
回答
如果我们可以容忍该正则表达式中包含数字和下划线,则可以例如使用\ w修饰符(Perl语法)。我相信某些引擎支持[:alpha:],但这不是纯Perl。 \ w考虑到我们所在的语言环境,并同时匹配大写和小写字母,我敢打赌,忽略大小写时,它比使用[A-Z]更快。
回答
如果我们对此有所担心,那么在进行检查之前,最好将大小写设置为全部大写或者全部小写。
例如,在Perl中:
$x = "abbCCDGBAdgfabv"; (lc $x) =~ /bad/;
在某些情况下可能会比
$x = "abbCCDGBAdgfabv"; $x =~ /bad/i;