bash Linux join 实用程序抱怨输入文件未排序

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25431673/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 11:11:56  来源:igfitidea点击:

Linux join utility complains about input file not being sorted

linuxbashsortingjointext-processing

提问by Razvan

I have two files:

我有两个文件:

file1 has the format:

file1 的格式为:

field1;field2;field3;field4

(file1 is initially unsorted)

(file1 最初未排序)

file2 has the format:

file2 的格式为:

field1

(file2 is sorted)

(file2 已排序)

I run the 2 following commands:

我运行以下 2 个命令:

sort -t\; -k1 file1 -o file1 # to sort file 1
join -t\; -1 1 -2 1 -o 1.1 1.2 1.3 1.4 file1 file2

I get the following message:

我收到以下消息:

join: file1:27497: is not sorted: line_which_was_identified_as_out_of_order

Why is this happening ?

为什么会这样?

(I also tried to sort file1 taking into consideration the entire line not only the first filed of the line but with no success)

(我还尝试对 file1 进行排序,不仅考虑到该行的第一个字段,而且没有成功)

sort -t\; -c file1doesn't output anything. Around line 27497, the situation is indeed strange which means that sort doesn't do its job correctly:

sort -t\; -c file1不输出任何东西。在第 27497 行附近,情况确实很奇怪,这意味着 sort 没有正确完成它的工作:

              XYZ113017;...
line 27497--> XYZ11301;...
              XYZ11301;...

回答by mklement0

To complement Wumpus Q. Wumbley's helpful answerwith a broader perspective (since I found this post researching a slightly different problem).

以更广泛的视角补充Wumpus Q. Wumbley 的有用答案(因为我发现这篇文章研究了一个稍微不同的问题)。

  • When using join, the input files must be sorted by the join field ONLY, otherwise you may see the warning reported by the OP.
  • 使用join时,输入文件必须进行排序由ONLY连接字段,否则你可能会看到由OP报告的警告。

There are two common scenarios in which morethan the field of interest is mistakenlyincluded when sortingthe input files:

在对输入文件进行排序时,有两种常见的场景会错误地包含多个感兴趣的字段

  • If you do specify a field, it's easy to forget that you must also specify a stopfield - even if you target only 1field- because sortuses the remainder of the line if only a startfield is specified; e.g.:

    • sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
    • sort -t, -k1,1 ... # Field 1 only
  • If your sort field is the FIRST field in the input, it's tempting to not specify any field selector at all.

    • However, if field values can be prefix substrings of each other, sorting whole lines will NOT (necessarily) result in the same sort order as just sorting by the 1st field:
    • sort ... # NOT always the same as 'sort -k1,1'! see below for example
  • 如果您确实指定了一个字段,很容易忘记您还必须指定一个停止字段 - 即使您只针对1 个字段- 因为sort如果只指定了一个开始字段,则会使用该行的其余部分;例如:

    • sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
    • sort -t, -k1,1 ... # Field 1 only
  • 如果您的排序字段是输入中的第一个字段,则很可能根本不指定任何字段选择器

    • 但是,如果字段值可以是彼此的前缀子字符串,则对整行进行排序不会(必然)导致与仅按第一个字段排序相同的排序顺序
    • sort ... # NOT always the same as 'sort -k1,1'! see below for example

Pitfall example:

陷阱示例:

#!/usr/bin/env bash

# Input data: fields separated by '^'.
# Note that, when properly sorting by field 1, the order should
# be "nameA" before "nameAA" (followed by "nameZ").
# Note how "nameA" is a substring of "nameAA".
read -r -d '' input <<EOF
nameA^other1
nameAA^other2
nameZ^other3
EOF

# NOTE: "WRONG" below refers to deviation from the expected outcome
#       of sorting by field 1 only, based on mistaken assumptions.
#       The commands do work correctly in a technical sense.

echo '--- just sort'
sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first

echo '--- sort FROM field 1'
sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first

echo '--- sort with field 1 ONLY'
sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first

Explanation:

解释:

  • When NOT limiting sorting to the first field, it is the relative sort order of chars. ^and A(column index 6) that matters in this example. In other words: the field separator is compared to data, which is the source of the problem: ^has a HIGHER ASCII value than A, and therefore sorts after'A', resulting in the line starting with nameAA^sorting BEFORE the one with nameA^.

  • Note: It is possible for problems to surface on oneplatform, but be masked on another, based on locale and character-set settings and/or the sortimplementation used; e.g., with a locale of en_US.UTF-8in effect, with ,as the separator and -permissible inside fields:

    • sortas used on OSX 10.10.2 (which is an oldGNU sortversion, 5.93) sorts ,before -(in line with ASCII values)
    • sortas used on Ubuntu 14.04 (GNU sort8.21) does the opposite: sorts -before ,[1]
  • 当不限制排序到第一个字段时,它是字符的相对排序顺序。^A(列索引 6)在这个例子中很重要。换句话说:将字段分隔符与 data 进行比较,这是问题的根源:^具有比 更高的 ASCII 值A,因此'A'之后进行nameAA^排序,从而导致该行在与nameA^.

  • 注意:基于区域设置和字符集设置和/或所使用的实现,问题可能会在一个平台上出现,但在另一个平台上被掩盖sort;例如,使用有效的语言环境en_US.UTF-8,作为分隔符和-允许的内部字段:

    • sort在 OSX 10.10.2(这是一个旧的GNUsort版本,5.93)上使用的排序,之前-(与 ASCII 值一致)
    • sort在 Ubuntu 14.04 (GNU sort8.21) 上使用的情况正好相反-[1]之前排序,

[1] I don't know why - if somebody knows, please tell me. Test with sort <<<$'-\n,'

[1] 我不知道为什么 - 如果有人知道,请告诉我。测试sort <<<$'-\n,'

回答by mklement0

sort -k1uses all fields starting from field 1 as the key. You need to specify a stop field.

sort -k1使用从字段 1 开始的所有字段作为键。您需要指定一个停止字段。

sort -t\; -k1,1

回答by skullnobrains

... or the gnu sort is just as buggy as every other GNU command

... 或者 gnu 排序和其他 GNU 命令一样有问题

try and sort Gi1/0/11 vs Gi1/0/1 and you'll never be able to get an actual regular textual sort suitable for join input because someone added some extra intelligence in sort which will happily use numeric or human numeric sorting automagically in such cases without even bothering to add a flag to force the regular behavior

尝试对 Gi1/0/11 与 Gi1/0/1 进行排序,您将永远无法获得适合连接输入的实际常规文本排序,因为有人在排序中添加了一些额外的智能,这将很高兴地自动使用数字或人工数字排序在这种情况下,甚至不必费心添加标志来强制执行常规行为

what is suitable for humans is seldom suitable for scripting

适合人类的很少适合脚本