bash LC_ALL=C 对加速 grep 的影响
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8138124/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Implications of LC_ALL=C to speedup grep
提问by elhoim
I just discovered that if i prefix my grep commands with a LC_ALL=C it does wonders for speeding grep up.
我刚刚发现,如果我在 grep 命令前加上 LC_ALL=C 前缀,它确实可以加快 grep 的速度。
But i am wondering about the implications.
但我想知道其中的含义。
Would a pattern using UTF-8 not match? What happens if the grepped file is using UTF-8?
使用 UTF-8 的模式会不匹配吗?如果 grepped 文件使用 UTF-8,会发生什么?
回答by thiton
You don't necessarily need UTF-8 to run into trouble here. The locale is responsible for setting the character classes, i.e. determining which character is a space, a letter or a digit. Consider these two examples:
您不一定需要 UTF-8 才能在这里遇到麻烦。语言环境负责设置字符类,即确定哪个字符是空格、字母或数字。考虑这两个例子:
$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep '[[:alnum:]]' || echo false
?
$ echo -e '\xe4' | LC_ALL=C grep '[[:alnum:]]' || echo false
false
When trying to match exact binary patterns against each other, the locale doesn't make a difference, however:
但是,当尝试将精确的二进制模式相互匹配时,语言环境没有任何区别:
$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep "$(echo -e '\xe4')" || echo false
?
$ echo -e '\xe4' | LC_ALL=C grep "$(echo -e '\xe4')" || echo false
?
I'm not sure about the extent of grep implementing unicode, and how well different codepoints are matched to each other, but matching any subset of ASCII and the matching of single characters without alternate binary representations should work fine regardless of locale.
我不确定 grep 实现 unicode 的范围,以及不同代码点彼此匹配的程度,但是匹配 ASCII 的任何子集和匹配没有替代二进制表示的单个字符应该可以正常工作,而不管语言环境如何。

