bash 加入发出警告“文件 1 未按排序顺序”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/26626407/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Join gives warning "file1 is not in sorted order"
提问by Rudy
Was testing a legacy script in the new version of bash 4.1.2(1)-release , and encountered this warning in the console:
在新版本的 bash 4.1.2(1)-release 中测试遗留脚本时,在控制台中遇到此警告:
join: file 1 is not in sorted order
join: file 2 is not in sorted order
I am quite sure that both of the files are sorted. The files actually merged properly.
我很确定这两个文件都已排序。这些文件实际上已正确合并。
Below is the script:
下面是脚本:
cat $FILE1_PATH'.processed.1' | cut -d'|' -f4,8 | sort | uniq -u > $FILE1_PATH.'processed.2'
cat $FILE2_PATH'.processed.1' | cut -d'|' -f1,8 | sort | uniq -u > $FILE2_PATH.'processed.2'
join -t$'|' -1 1 -2 1 $FILE1_PATH.'processed.2' $FILE2_PATH.'processed.2' > $MERGEFILE_PATH
To job of this script :
这个脚本的工作:
- extract field 4 and 8 from file 1
- extract field 1 and 8 from file 2
- combine the extracted fields, using join key file1.field4 = file2.field1
- remove any duplicates.
- 从文件 1 中提取字段 4 和 8
- 从文件 2 中提取字段 1 和 8
- 使用连接键 file1.field4 = file2.field1 组合提取的字段
- 删除任何重复项。
FILE1.processed.2 :
FILE1.processed.2 :
21VIANET GP INC|GOV
ABN|ABN1
ABN|ABN2
ABOC|ABOC1
ABOC|ABOC1
ABOC|ABOC2
....
FILE2.processed.2 :
FILE2.processed.2 :
ABN|Banks
ABOC|Pharmaceuticals
GOV|Government Agency
....
OUTPUT:
输出:
GOV|21VIANET GP INC|Government Agency
ABN|ABN1|Banks
ABN|ABN2|Banks
ABOC|ABOC1|Pharmaceuticals
ABOC|ABOC2|Pharmaceuticals
....
Running the same script in the bash version 3.2.25(1)-release gives no warning. Any idea to solve the warning?
在 bash 版本 3.2.25(1)-release 中运行相同的脚本不会发出警告。任何想法来解决警告?
UPDATE: Seems that the cause was caused by these lines in the input files...
更新:似乎原因是由输入文件中的这些行引起的......
ADBC|Banks
ADB|Banks
Join expects ADBC to be positioned after ADB, like below :
Join 期望 ADBC 位于 ADB 之后,如下所示:
ADB|Banks
ADBC|Banks
However I tried to change my sort script from sort -u to sort -t$'|' -k1 (sort based on the first field ) however still not working...
但是我尝试将排序脚本从 sort -u 更改为 sort -t$'|' -k1(根据第一个字段排序)但是仍然无法正常工作...
回答by
The suggestion in the join
man page is to use sort -k 1b,1
when you're joining on field 1. (It says "when join has no options" but as far as field selection is concerned, your join is equivalent to no options. -1 1
and -2 1
are the defaults.) You can add -t '|'
to that and it will match your join
perfectly.
join
手册页中的建议是sort -k 1b,1
在您加入字段 1 时使用。(它说“当加入没有选项时”但就字段选择而言,您的加入相当于没有选项。-1 1
并且-2 1
是默认值。 ) 你可以添加-t '|'
它,它会join
完美匹配你的。
-k1
means all fields from 1 to the end. -k1,1
means just field 1. The b
is necessary if you have leading whitespace and want to ignore it. sort syntax is weird. And this is afterPOSIX redesigned it to try to make it sensible. If you ever write a sort command that doesn't look complicated, it's probably not doing what you wanted.
-k1
表示从 1 到结尾的所有字段。-k1,1
意味着只是字段 1。b
如果您有前导空格并想忽略它,这是必要的。排序语法很奇怪。这是在POSIX 重新设计它以使其变得合理之后。如果您曾经编写过一个看起来并不复杂的排序命令,那么它可能没有执行您想要的操作。
Add --debug
to your sort command to see what it's using as a key. With a sample file containing these lines:
添加--debug
到您的排序命令以查看它用作键的内容。使用包含这些行的示例文件:
ADBC|Banks
ADB|Banks
ADBC|Banks
you can see the effect of various -k
options:
您可以看到各种-k
选项的效果:
$ sort -s -t '|' -k 1 --debug file
sort: using simple byte comparison
ADBC|Banks
___________
ADBC|Banks
__________
ADB|Banks
_________
$ sort -s -t '|' -k 1,1 --debug file
sort: using simple byte comparison
ADBC|Banks
_____
ADB|Banks
___
ADBC|Banks
____
$ sort -s -t '|' -k 1b,1 --debug file
sort: using simple byte comparison
ADB|Banks
___
ADBC|Banks
____
ADBC|Banks
____
Now you're probably wondering about the -s
I threw in there. Without it, there is a default last-resort comparison of the whole line as a string, which applies to lines with equal keys. That's not normally a problem and you probably don't need to use -s
. It's just that when using --debug
, the last-resort comparison clutters the list so I like to use -s
to get rid of it.
现在你可能想知道-s
我扔在那里的东西。如果没有它,则将整行作为字符串进行默认的最后比较,这适用于具有相同键的行。这通常不是问题,您可能不需要使用-s
. 只是在使用时--debug
,最后的比较使列表变得混乱,所以我喜欢使用-s
它来摆脱它。