bash 基于两列连接两个文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7392204/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 00:46:05  来源:igfitidea点击:

join two files based on two columns

bashawk

提问by Nick

Believe it or not, I've searched all over the internet and haven't found a working solution for this problem in AWK.

信不信由你,我已经在整个互联网上进行了搜索,但没有在 AWK 中找到解决此问题的有效解决方案。

I have two files, A and B:

我有两个文件,A 和 B:

File A:

文件A:

chr1   pos1   
chr1   pos2
chr2   pos1
chr2   pos2

File B:

文件乙:

chr1 pos1
chr2 pos1
chr3 pos2

Desired Output:

期望输出:

chr1 pos1
chr2 pos1

I'd like to join these two files to basically get the intersection between the two files based on the first AND second columns, not just the first. Since this is the case, most simple scripts won't work and join doesn't seem to be an option.

我想加入这两个文件以基本上根据第一列和第二列获得两个文件之间的交集,而不仅仅是第一列。由于是这种情况,大多数简单的脚本都不起作用,而且 join 似乎不是一种选择。

Any ideas?

有任何想法吗?

EDIT: sorry, I didn't mention that there are more columns than just the two I showed. I've only shown two in my example because I'm only interested in the first two columns between both files being identical, the rest of the data aren't important (but are nonetheless in the file)

编辑:抱歉,我没有提到除了我展示的两列之外还有更多的列。我在我的例子中只展示了两个,因为我只对两个文件之间的前两列感兴趣,其余的数据并不重要(但仍然在文件中)

采纳答案by Aif

Hum, my idea is the following: Use jointo merge the two files and correct with awk

嗯,我的想法是这样的:join用于合并两个文件并用awk修正

$ join  A B 
chr1 pos1 pos1
chr1 pos2 pos1
chr2 pos1 pos1
chr2 pos2 pos1

$ join  A B | awk '{ if (==) printf("%s %s\n", , ) }'
chr1 pos1 pos1
chr2 pos1 pos1

Edit: given the edit, the join solution may still work (with options), so the concept remains correct (imo).

编辑:给定编辑,加入解决方案可能仍然有效(带有选项),因此概念仍然正确(imo)。

回答by glenn Hymanman

The awk solution is:

awk 解决方案是:

awk 'FILENAME==ARGV[1] {pair[ " " ]; next} ( " "  in pair)' fileB fileA

Place the smaller file first since you have to basically hold it in memory.

首先放置较小的文件,因为您基本上必须将它保存在内存中。

回答by Dimitre Radoulov

I would write it like this:

我会这样写:

awk 'NR == FNR {
  k[, ]
  next
  }
(, ) in k
  ' filea fileb  

The order of the input files might need to be adapted based on the exact requirement.

可能需要根据具体要求调整输入文件的顺序。

回答by anubhava

Why not simple grep -flike this:

为什么grep -f不像这样简单:

grep -f fileB fileA

EDIT:

编辑:

For files having more than 2 columns try this:

对于超过 2 列的文件,试试这个:

grep "$(cut -d" " -f1,2 fileB)" fileA | cut -d" " -f1,2