Linux UNIX 排序忽略空格
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6923464/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
UNIX sort ignores whitespaces
提问by dagnelies
Given a file txt
:
给定一个文件txt
:
ab
a c
a a
When calling sort txt
, I obtain:
调用时sort txt
,我获得:
a a
ab
a c
In other words, it is not proper sorting, it kind of deletes/ignores the whitespaces! I expected this to be the behavior of sort -i
but it happens with or without the -i
flag.
换句话说,它不是正确的排序,它有点删除/忽略空格!我预计这是 的行为,sort -i
但无论是否有-i
标志都会发生。
I would like to obtain "correct" sorting:
我想获得“正确”的排序:
a a
a c
ab
How should I do that?
我该怎么做?
采纳答案by dagnelies
Solved by:
解决者:
export LC_ALL=C
WARNING: The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
警告:环境指定的语言环境会影响排序顺序。设置 LC_ALL=C 以获取使用本机字节值的传统排序顺序。
(works for ASCII at least, no idea for UTF8)
(至少适用于 ASCII,不知道适用于 UTF8)
回答by Ray Toal
Actually for me
其实对我
$ cat txt
ab
a c
a a
$ sort txt
a a
a c
ab
I'll bet between your a
and c
you have a non-breaking space or an enspace or an empspace or other high-codepoint space!
我敢打赌你a
和c
你之间有一个不间断的空间或一个 enspace 或一个 empspace 或其他高代码点空间!
EDIT
编辑
Just ran it on Linux. I should have looked at the tags. Yes I get the same output you do! My first run was on the Mac. Looks like a difference between GNU and BSD. I will investigate further.
刚刚在Linux上运行它。我应该看看标签。是的,我得到与您相同的输出!我的第一次运行是在 Mac 上。看起来像是 GNU 和 BSD 之间的区别。我会进一步调查。
EDIT 2:
编辑2:
Linux uses a field-based sort.... still looking for how to suppress it. Tried
Linux 使用基于字段的排序......仍在寻找如何抑制它。试过
sort -t, txt
hoping to trick GNU into thinking the whole line was one field, but it still used the current locale to sort.
希望欺骗 GNU 认为整行是一个字段,但它仍然使用当前的语言环境进行排序。
EDIT 3:
编辑 3:
The OP solved the problem by setting the locale to C with
OP 通过将语言环境设置为 C 解决了这个问题
export LC_ALL=C
There seems to be no other approach. The sort
command will use the current locale, and although it often says the C
(or its alias POSIX
) is the default locale, if you have Linux it has probably been set for you. Enter locale -a
to see the available locales. On my system:
似乎没有其他方法。该sort
命令将使用当前语言环境,尽管它经常说C
(或其别名POSIX
)是默认语言环境,但如果您使用的是 Linux,它可能已经为您设置好了。输入locale -a
以查看可用的语言环境。在我的系统上:
$ locale -a
C
POSIX
en_AG
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_NG
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZW.utf8
It seems like setting the locale to C (or its alias POSIX) is the only way to break the field-based behavior of sort
and treat the whole line as one field. It is rather odd IMHO that this is how to do it. I would think the -t
or -k
options, or perhaps some new option would be a more sensible way to make this happen.
似乎将语言环境设置为 C(或其别名 POSIX)是打破基于字段的行为sort
并将整行视为一个字段的唯一方法。恕我直言,这是如何做到的,这很奇怪。我认为-t
or-k
选项,或者一些新选项可能是实现这一目标的更明智的方式。
BTW, it looks like this question has been asked before on SO: unexpected result from gnu sort.
顺便说一句,看起来这个问题之前已经被问到过 SO:unexpected result from gnu sort。
回答by Karoly Horvath
Weird, works here (cygwin).
奇怪,在这里工作(cygwin)。
Try sort -d txt
.
试试sort -d txt
。
回答by thiton
Like mentioned before, LC_ALL=C sort
does the trick. This is simply because different languages have different rules for sorting characters, which are often laid out by senior linguists instead of CS experts. And these rules, in the case of your locale, seem to say that spaces ought to be ignored in sorting.
就像之前提到的,LC_ALL=C sort
有诀窍。这仅仅是因为不同的语言有不同的字符排序规则,这些规则通常由高级语言学家而不是 CS 专家制定。这些规则,就您的语言环境而言,似乎表明在排序时应该忽略空格。
By prefixing LC_ALL=C (or, when LC_ALL is unset, LC_COLLATE=C
suffices), you explicitely declare language-agnostic sorting (and, with LC_ALL, number-formatting and stuff), which is what you want in this context. If you want to make this your default, export LC_COLLATE in your environment.
通过给 LC_ALL=C 加上前缀(或者,当 LC_ALL 未设置时,LC_COLLATE=C
就足够了),您明确声明了与语言无关的排序(并且,使用 LC_ALL,数字格式和其他内容),这就是您在此上下文中想要的。如果要将其设为默认值,请在您的环境中导出 LC_COLLATE。
The default is chosen in this way to keep consistency with the "normal", real-world sorting schemes (like the white pages), which often ignored spaces.
以这种方式选择默认值是为了与通常忽略空格的“正常”、现实世界的排序方案(如白页)保持一致。
回答by Colin
You could use the 'env' program to temporarily change your LC_COLLATE for the duration of the sort; e.g.
您可以使用“env”程序在排序期间临时更改您的 LC_COLLATE;例如
/usr/bin/env LC_COLLATE=POSIX /bin/sort file1 file2
/usr/bin/env LC_COLLATE=POSIX /bin/sort file1 file2
It's a little cumbersome on the command line but if you're using it in a script should be transparent.
在命令行上有点麻烦,但如果你在脚本中使用它应该是透明的。
回答by mateor
I have been looking at this for a little while, wanting to optimize a shell script I maintain that has a heavy international userbase. (heavy as in percentage, not quantity).
我已经研究了一段时间,想要优化我维护的具有大量国际用户群的 shell 脚本。(重按百分比,而不是数量)。
Most of the options I saw around the web and SO seem to recommend what I see here, setting the locale globally (overkill)
我在网上看到的大多数选项似乎都推荐我在这里看到的内容,全局设置区域设置(矫枉过正)
export LC_ALL=C
or piping it into each individual command like this from gnu.org(tedious)
或者从gnu.org 将它输送到每个单独的命令中(乏味)
$ echo abcdefghijklmnopqrstuvwxyz | LC_ALL=C /usr/xpg4/bin/tr 'a-z' 'A-Z' ABCDEFGHIJKLMNOPQRSTUVWXYZ
I wanted to avoid clobbering the user's locale as a unseen side effect of running my program. This turned out to be easily accomplished just as you would expect, by leaving off the globalization. No need to export this variable past your program.
我想避免破坏用户的语言环境作为运行我的程序的一个看不见的副作用。事实证明,这很容易实现,正如您所期望的那样,通过放弃全球化。无需将此变量导出到您的程序。
I had to set LANG instead of LC_ALL for some reason, but all the individual locales were set which is functionally enough for me.
出于某种原因,我不得不设置 LANG 而不是 LC_ALL,但是设置了所有单独的语言环境,这对我来说在功能上已经足够了。
Here is the test, simple as can be
这是测试,尽可能简单
#!/bin/bash
# locale_checker.sh
#Check and set locale to LC_ALL to optimize character sort and search.
echo "locale was $LANG"
LANG=C
locale
and output + proof that it is temporary and can be restricted to my script's process.
并输出+证明它是临时的并且可以限制在我的脚本进程中。
mateor@:~/snippets$ ./locale_checker.sh
locale was en_US.UTF-8
LANG=C
LANGUAGE=en_US:en
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
mateor@:~/snippets$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
There you go. You get the optimized locale without clobbering another person's innocent environment as well as avoid the tedium of piping it everywhere you think it may help.
你去吧。您可以在不破坏他人无辜环境的情况下获得优化的区域设置,并避免在您认为可能有帮助的任何地方进行繁琐的管道设置。
回答by koskenni
Using the C locale i.e. sorting just by byte values is not a good solution in languages where some letters are outside the range [A-Za-z]. Such letters are represented as multiple bytes in UTF-8 and then the byte value collating order is not what one desires. (Some characters may have two equivalent representations (pre-composed and de-composed)).
在某些字母超出 [A-Za-z] 范围的语言中,使用 C 语言环境即仅按字节值排序并不是一个好的解决方案。这样的字母在 UTF-8 中表示为多个字节,然后字节值的整理顺序不是人们想要的。(某些字符可能有两种等效的表示(预组合和分解))。
Nevertheless, the treatment of spaces is a problem. I tried the following:
然而,空间的处理是一个问题。我尝试了以下方法:
$ cat stest
a b
a c
ab
a d
$ sort stest
ab
a b
a c
a d
$ sort -k 1,1 stest
a b
a c
a d
ab
For my needs, the -k 1,1 did the trick. Another but clumsier solution I tried, was to change spaces to some auxiliary character, then sort, then change the auxiliaries back into blanks.
根据我的需要, -k 1,1 做到了。我尝试过的另一个笨拙的解决方案是将空格更改为某个辅助字符,然后排序,然后将辅助字符改回空格。