bash 在 Unix 上连接文本文件中的多个字段

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2619562/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 19:10:19  来源:igfitidea点击:

Joining multiple fields in text files on Unix

linuxbashunixjoin

提问by neversaint

How can I do it?

我该怎么做?

File1looks like this:

File1看起来像这样:

foo 1 scaf 3 
bar 2 scaf 3.3

File2looks like this:

File2看起来像这样:

foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00

What I want to do is to find lines that co-occur in File1and File2when fields 1,2, and 3are the same.

我想要做的是 在字段1,2 和 3相同时找到在File1File2中共同出现的行。

Is there a way to do it?

有没有办法做到这一点?

采纳答案by ghostdog74

you can try this

你可以试试这个

awk '{
 o1=;o2=;o3=
 ==="";gsub(" +","")
 _[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS 
$ ./shell.sh
foo 1 scaf  3 4.5
bar 2 scaf  3.3 1.00
foo 1 boo  2.3
} END{ for(i in _) print i,_[i] }' file1 file2

output

输出

awk 'FNR==NR{
 s=""
 for(i=4;i<=NF;i++){ s=s FS $i }
 _[] = s
 next
}
{
  printf  FS  FS  FS
  for(o=4;o<NF;o++){
   printf $i" "
  }
  printf $NF FS _[]"\n"
 } ' file2 file1

If you want to omit uncommon lines

如果你想省略不常见的行

$ ./shell.sh
foo 1 scaf 3  4.5
bar 2 scaf 3.3  1.00

output

输出

$ join -j1 -o1.2,1.3,1.4,1.5,2.5 <(<file1 awk '{print "-""-"" "
$ cat file1
foo 1 scaf 3
bar 2 scaf 3.3    

$ <file1 awk '{print "-""-"" "
$ cat file1
foo 1 scaf 3
bar 2 scaf 3.3    

$ <file1 awk '{print "-""-"" "
awk '{printf("%s:%s:%s %s %s %s %s\n", , , , , , , );}' file1 |
sort > sort1
awk '{printf("%s:%s:%s %s %s %s %s\n", , , , , , , );}' file2 |
sort > sort2
join -1 1 -2 1 -o 1.2,1.3,1.4,1.5,2.5 sort1 sort2
}' foo-1-scaf foo 1 scaf 3 bar-2-scaf bar 2 scaf 3.3
}' foo-1-scaf foo 1 scaf 3 bar-2-scaf bar 2 scaf 3.3
}' | sort -k1,1) <(<file2 awk '{print "-""-"" "
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5
}' | sort -k1,1) bar 2 scaf 3.3 1.00 foo 1 scaf 3 4.5

回答by thedk

Here is the correctanswer (in terms of using standard GNU coreutilstools, and not writing custom script in perl/awkyou name it).

这是正确的答案(就使用标准GNU coreutils工具而言,而不是在perl/awk 中编写自定义脚本)。

$ cat file1
foo 1 scaf 3 
bar 2 scaf 3.3
$ cat file2
foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00
$ join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2
foo 1 scaf 3 4.5 
bar 2 scaf 3.3 4.5 
$

OK, how does it work:

好的,它是如何工作的:

  1. First of all we will use a great tool joinwhich can merge two lines. joinhas two requirements:

    • We can joinonly by a single field.
    • Both files must be sortedby key column!
  2. We need to generate keysin input files and for that we use a simple awkscript:

    $ cat file1
    foo 1 scaf 3 
    bar 2 scaf 3.3
    $ cat file2
    foo 1 scaf 4.5
    foo 1 boo 2.3
    bar 2 scaf 1.00
    $ join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2
    foo 1 scaf 3 4.5 
    bar 2 scaf 3.3 4.5 
    $
    

    You see, we added 1st column with some key like "foo-1-scaf". We do the same with file2. BTW. <file awk, is just fancy way of writing awk file, or cat file | awk.

    We also should sortour files by the key, in our case this is column 1, so we add to the end of the command the | sort -k1,1(sortby text from column 1 to column 1)

  3. At this point we could just generate files file1.with.keyand file2.with.keyand join them, but suppose those file are huge, we don't want to copy them over filesystem. Instead we can use something called bashprocess substitutionto generate output into named pipe (this will avoid any unnecessary intermediate file creation). For more info please read the provided link.

    Our target syntax is: join <( some command ) <(some other command)

  4. The last thing is to explain fancy join arguments: -j1 -o1.2,1.3,1.4,1.5,2.5

    • -j1- join by key in 1st column (in both files)
    • -o- output only those fields 1.2(1st file field2), 1.3(1st file column 3), etc.

      This way we joined lines, but joinoutputs only the necessary columns.

  1. 首先,我们将使用一个很好的工具join,它可以合并两条线。join有两个要求:

    • 我们只能通过一个字段加入
    • 两个文件都必须关键列排序
  2. 我们需要在输入文件中生成密钥,为此我们使用一个简单的awk脚本:

    awk '{print  "_"  "_"  " " }' filename
    

    你看,我们添加了第一列,其中包含一些诸如“ foo-1-scaf”之类的键。我们对file2做同样的事情。顺便提一句。<file awk,只是花哨的写作方式awk file,或者cat file | awk

    我们还应该按键我们的文件进行排序,在我们的例子中这是第 1 列,因此我们在命令的末尾添加| sort -k1,1(按第 1 列到第 1 列的文本排序

  3. 此时我们可以只生成文件file1.with.keyfile2.with.key并加入它们,但假设这些文件很大,我们不想将它们复制到文件系统上。相反,我们可以使用称为bash进程替换的东西将输出生成到命名管道中(这将避免任何不必要的中间文件创建)。有关更多信息,请阅读提供的链接。

    我们的目标语法是: join <( some command ) <(some other command)

  4. 最后一件事是解释花式连接参数: -j1 -o1.2,1.3,1.4,1.5,2.5

    • -j1- 在第一列中通过键加入(在两个文件中)
    • -o- 仅输出那些字段1.2(第1.3一个文件字段 2)、(第一个文件第 3 列)等。

      这样我们连接了行,但join只输出必要的列。

The lessons learned from this post should be:

从这篇文章中吸取的教训应该是:

  • you should master the coreutilspackage, those tools are very powerful when combined and you almost never need towrite custom program to deal with such cases,
  • core utils tools are also blazing fast and heavily tested, so they are always best choice.
  • 你应该掌握coreutils包,这些工具组合起来非常强大,你几乎不需要编写自定义程序来处理这种情况,
  • core utils 工具也非常快速且经过大量测试,因此它们始终是最佳选择。

回答by Jonathan Leffler

The join command is hard to use and only joins on one column

join 命令很难使用,只能连接一列

Extensive experimentation plus close scrutiny of the manual pages indicates that you cannot directly join multiple columns - and all my working examples of join, funnily enough, use just one joining column.

广泛的实验加上对手册页的仔细表明您不能直接连接多个列 - 有趣的是,我所有的连接工作示例都只使用一个连接列。

Consequently, any solution will require the columns-to-be-joined to be concatenated into one column, somehow. The standard join command also requires its inputs to be in the correct sorted order - there's a remark in the GNU join (info coreutils join) about it not always requiring sorted data:

因此,任何解决方案都需要以某种方式将要连接的列连接成一列。标准 join 命令还要求其输入按正确的排序顺序排列 - 在 GNU join (info coreutils join) 中有一条关于它并不总是需要排序数据的注释:

However, as a GNU extension, if the input has no unpairable lines the sort order can be any order that considers two fields to be equal if and only if the sort comparison described above considers them to be equal.

但是,作为 GNU 扩展,如果输入没有不可配对的行,则排序顺序可以是将两个字段视为相等的任何顺序,当且仅当上述排序比较认为它们相等时。

One possible way to do it with the given files is:

对给定文件执行此操作的一种可能方法是:

cat file1 file2
    | awk '{print " "" "}'
    | sort
    | uniq -c
    | grep -v '^ *1 '
    | awk '{print " "" "}'

This creates a composite sort field at the start, using ':' to separate the sub-fields, and then sorts the file - for each of two files. The join command then joins on the two composite fields, but prints out only the non-composite (non-join) fields.

这会在开始时创建一个复合排序字段,使用“:”分隔子字段,然后对文件进行排序 - 对于两个文件中的每一个。然后 join 命令连接两个复合字段,但仅打印出非复合(非连接)字段。

The output is:

输出是:

cat file1
    | sed
        -e 's/ [^ ]*$/ "/'
        -e 's/ /  */g'
        -e 's/^/grep "^/'
        -e 's/$/ file2 | awk "{print \\" \"\\" \"\}"/'
    >xx99
bash xx99
rm xx99

Failed attempts to make join do what it won't do

失败的加入尝试做它不会做的事情

join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2

加入 -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2

On MacOS X 10.6.3, this gives:

#!/usr/local/bin/perl
use warnings;
use strict;
open my $file1, "<", "file1" or die $!;
my %file1keys;
while (<$file1>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    $file1keys{$keys[0]}{$keys[1]}{$keys[2]} = [$., $_];
}
close $file1 or die $!;
open my $file2, "<", "file2" or die $!;
while (<$file2>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    if (my $found = $file1keys{$keys[0]}{$keys[1]}{$keys[2]}) {
        print "Keys occur at file1:$found->[0] and file2:$..\n";
    }
}
close $file2 or die $!;

This is joining on field 3 (only) - which is not what is wanted.

在 MacOS X 10.6.3 上,这给出:

cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g

这是加入第 3 场(仅) - 这不是我们想要的。

You do need to ensure that the input files are in the correct sorted order.

您确实需要确保输入文件的排序顺序正确。

回答by Michael Mrozek

It's probably easiest to combine the first three fields with awk:

将前三个字段与 awk 结合起来可能是最简单的:

bar 2 scaf 1.00
bar 2 scaf 3.3
foo 1 scaf 3 
foo 1 scaf 4.5

Then you can use joinnormally on "field 1"

然后你可以join在“字段1”上正常使用

回答by paxdiablo

How about:

怎么样:

cut -d ' ' -f1-3 File1 | grep -h -f - File1 File2 | \
datamash -t ' ' -s -g1,2,3 collapse 4

This is assuming you're not too worried about the white space between fields (in other words, three tabs and a space is no different to a space and 7 tabs). This is usually the case when you're talking about fields within a text file.

这是假设您不太担心字段之间的空白(换句话说,三个制表符和一个空格与一个空格和 7 个制表符没有区别)。当您谈论文本文件中的字段时,通常就是这种情况。

What it does is output both files, stripping off the last field (since you don't care about that one in terms of comparisons). It the sorts that so that similar lines are adjacent then uniquifies them (replaces each group of adjacent identical lines with one copy and a count).

它的作用是输出两个文件,去掉最后一个字段(因为在比较方面你不关心那个)。它排序,以便相似的行相邻然后将它们统一(用一个副本和一个计数替换每组相邻的相同行)。

It then gets rid of all those that had a one-count (no duplicates) and prints out each with the count stripped off. That gives you your "keys" to the duplicate lines and you can then use another awk iteration to locate those keys in the files if you wish.

然后它去掉所有那些有一次计数(没有重复)的人,并打印出每一个都去掉计数。这为您提供了重复行的“键”,然后如果您愿意,您可以使用另一个 awk 迭代来在文件中定位这些键。

This won'twork as expected if two identical keys are only in one file since the files are combined early on. In other words, if you have duplicate keys in file1but not in file2, that will be a false positive.

如果两个相同的密钥只在一个文件中,这将不会按预期工作,因为这些文件很早就被合并了。换句话说,如果您有重复的键 infile1但没有 in file2,那将是误报。

Then, the only real solution I can think of is a solution which checks file2for each line in file1although I'm sure others may come up with cleverer solutions.

然后,我能想到的唯一真正的解决方案是检查file2每一行的解决方案,file1尽管我相信其他人可能会提出更聪明的解决方案。



And, for those who enjoy a little bit of sado-masochism, here's the afore-mentioned not-overly-efficient solution:

而且,对于那些享受一点施虐受虐狂的人来说,这里是上述不太有效的解决方案:

bar 2 scaf 3.3,1.00
foo 1 scaf 3,4.5

This one constructs a separate script file to do the work. For each line in file1, it creates a line in the script to look for that in file2. If you want to see how it works, just have a look at xx99before you delete it.

这将构造一个单独的脚本文件来完成这项工作。对于 中的每一行file1,它会在脚本中创建一行以在 中查找file2。如果您想了解它是如何工作的,请在xx99删除之前先查看一下。

And, in this one, the spaces do matter so don't be surprised if it doesn't work for lines where spaces are different between file1and file2(though, as with most "hideous" scripts, that can be fixed with just another link in the pipeline). It's more here as an example of the ghastly things you can create for quick'n'dirty jobs.

而且,在这个中,空格确实很重要,所以如果它不适用于file1和之间的空格不同的行,请不要感到惊讶file2(尽管,与大多数“可怕的”脚本一样,只需使用另一个链接即可修复)管道)。它更多地是作为您可以为快速不肮脏的工作创造的可怕事物的一个例子。

This is notwhat I would do for production-quality code but it's fine for a once-off, provided you destroy all evidence of it before The Daily WTFfinds out about it :-)

不是我会为生产质量代码做的事情,但一次性很好,前提是你在每日 WTF发现之前销毁它的所有证据:-)

回答by paxdiablo

Here is a way to do it in Perl:

这是在 Perl 中执行此操作的一种方法:

cut -d ' ' -f1-3 File1 | sort -u | grep -h -f - File1 File2 | sort -t ' ' -k 1,2g

回答by agc

Simple method (no awk, join, sed, or perl), using software tools cut, grep, and sort:

简单的方法(不AWK加盟sed的,或Perl的),使用的软件工具cutgrep以及sort

$ cat file1.txt |awk -F" " '{print "-""-"";"
cat File* | datamash -t ' ' -s -g1,2,3  collapse 4 | sort -g -k2 | tr ',' ' '
}' |sort >file1.tmp $ cat file2.txt |awk -F" " '{print "-""-"";"
foo 1 boo 2.3
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
}' |sort >file2.tmp $ join -t; -o 1.2 file1.tmp file2.tmp >file1.same.txt $ join -t; -o 2.2 file1.tmp file2.tmp >file2.same.txt $ rm -f file1.tmp file2.tmp $ cat file1.same.txt bar 2 scaf 3.3 foo 1 scaf 3 $ cat file2.same.txt bar 2 scaf 1.00 foo 1 scaf 4.5

Output (does not print unmatched lines):

输出(不打印不匹配的行):

##代码##

How it works...

这个怎么运作...

  1. cutmakes a list of all the lines to search for.
  2. grep's -f -switch inputs the lines from cutand searches File1and File2for them.
  3. sortisn't necessary, but makes the data easier to read.
  1. cut列出要搜索的所有行。
  2. grep-f -开关输入来自cut并搜索File1File2的行。
  3. sort不是必需的,但使数据更易于阅读。


Condensed results with datamash:

浓缩结果datamash

##代码##

Output:

输出:

##代码##

If File1is huge and is somewhat redundant, adding sort -ushould speed things up:

如果File1很大并且有点多余,添加sort -u应该会加快速度:

##代码##

回答by Tyler McHenry

A professor I used to work with created a set of perl scripts that can perform a lot of database-like operations on column-oriented flat text files. It's called Fsdb. It can definitely do this, and it's especially worth looking into if this isn't just a one-off need (so you're not constantly writing custom scripts).

我曾经与之共事的一位教授创建了一组 perl 脚本,这些脚本可以对面向列的纯文本文件执行许多类似数据库的操作。它被称为Fsdb。它绝对可以做到这一点,如果这不仅仅是一次性需求(因此您不会经常编写自定义脚本),则特别值得研究。

回答by LukStorms

A similar solution as the one Jonathan Leffler offered.

与 Jonathan Leffler 提供的解决方案类似的解决方案。

Create 2 temporary sorted files with a different delimeter which has the matching columns combined in the first field. Then join the temp files on the first field, and output the second field.

使用不同的分隔符创建 2 个临时排序文件,其中匹配的列组合在第一个字段中。然后加入第一个字段上的临时文件,并输出第二个字段。

##代码##

回答by agc

Using datamash's collapseoperation, plus a bit of cosmetic sorting and tring:

usingdatamash折叠操作,加上一点修饰sorting 和tring:

##代码##

Output (common lines have a 5th field, uncommon lines do not):

输出(常用行有第 5 个字段,不常用行没有):

##代码##