bash 基于两列连接两个文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7392204/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
join two files based on two columns
提问by Nick
Believe it or not, I've searched all over the internet and haven't found a working solution for this problem in AWK.
信不信由你,我已经在整个互联网上进行了搜索,但没有在 AWK 中找到解决此问题的有效解决方案。
I have two files, A and B:
我有两个文件,A 和 B:
File A:
文件A:
chr1 pos1
chr1 pos2
chr2 pos1
chr2 pos2
File B:
文件乙:
chr1 pos1
chr2 pos1
chr3 pos2
Desired Output:
期望输出:
chr1 pos1
chr2 pos1
I'd like to join these two files to basically get the intersection between the two files based on the first AND second columns, not just the first. Since this is the case, most simple scripts won't work and join doesn't seem to be an option.
我想加入这两个文件以基本上根据第一列和第二列获得两个文件之间的交集,而不仅仅是第一列。由于是这种情况,大多数简单的脚本都不起作用,而且 join 似乎不是一种选择。
Any ideas?
有任何想法吗?
EDIT: sorry, I didn't mention that there are more columns than just the two I showed. I've only shown two in my example because I'm only interested in the first two columns between both files being identical, the rest of the data aren't important (but are nonetheless in the file)
编辑:抱歉,我没有提到除了我展示的两列之外还有更多的列。我在我的例子中只展示了两个,因为我只对两个文件之间的前两列感兴趣,其余的数据并不重要(但仍然在文件中)
采纳答案by Aif
Hum, my idea is the following:
Use jointo merge the two files and correct with awk
嗯,我的想法是这样的:join用于合并两个文件并用awk修正
$ join A B
chr1 pos1 pos1
chr1 pos2 pos1
chr2 pos1 pos1
chr2 pos2 pos1
$ join A B | awk '{ if (==) printf("%s %s\n", , ) }'
chr1 pos1 pos1
chr2 pos1 pos1
Edit: given the edit, the join solution may still work (with options), so the concept remains correct (imo).
编辑:给定编辑,加入解决方案可能仍然有效(带有选项),因此概念仍然正确(imo)。
回答by glenn Hymanman
The awk solution is:
awk 解决方案是:
awk 'FILENAME==ARGV[1] {pair[ " " ]; next} ( " " in pair)' fileB fileA
Place the smaller file first since you have to basically hold it in memory.
首先放置较小的文件,因为您基本上必须将它保存在内存中。
回答by Dimitre Radoulov
I would write it like this:
我会这样写:
awk 'NR == FNR {
k[, ]
next
}
(, ) in k
' filea fileb
The order of the input files might need to be adapted based on the exact requirement.
可能需要根据具体要求调整输入文件的顺序。
回答by anubhava
Why not simple grep -flike this:
为什么grep -f不像这样简单:
grep -f fileB fileA
EDIT:
编辑:
For files having more than 2 columns try this:
对于超过 2 列的文件,试试这个:
grep "$(cut -d" " -f1,2 fileB)" fileA | cut -d" " -f1,2

