bash 匹配两个文件的第一列中的值并将匹配的行加入新文件中
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14195954/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
match values in first column of two files and join the matching lines in a new file
提问by user1911823
I need to find matches with the string in column 1 ($1) in file1.txt with the string in column 1 ($1) in file2.txt. Then I want to join the lines where there was a match in a new file.
我需要找到与file1.txt 中第1 列($1) 中的字符串与file2.txt 中第1 列($1) 中的字符串的匹配项。然后我想加入新文件中匹配的行。
cat file1.txt
1050008 5.156725968 8.404038296 124.9198605 3.23E-21 2.33E-17 38.57865782
3310747 5.631470026 8.581936875 124.6039122 3.34E-21 2.33E-17 38.55204806
5910451 4.900364671 8.455329195 124.5720603 3.35E-21 2.33E-17 38.54935989
730156 5.565210738 8.48792701 122.2168789 4.28E-21 2.33E-17 38.34773989
cat file2.txt
4230037 ILMN Controls ILMN_Controls ERCC-00071 ILMN_333646 ERCC-00071 ERCC-00071
1050008 ILMN Controls ILMN_Controls ERCC-00009 ILMN_333584 ERCC-00009 ERCC-00009
5260356 ILMN Controls ILMN_Controls ERCC-00053 ILMN_333628 ERCC-00053 ERCC-00053
3310747 ILMN Controls ILMN_Controls ERCC-00144 ILMN_333719 ERCC-00144 ERCC-00144
5910451 ILMN Controls ILMN_Controls ERCC-00003 ILMN_333578 ERCC-00003 ERCC-00003
1710435 ILMN Controls ILMN_Controls ERCC-00138 ILMN_333713 ERCC-00138 ERCC-00138
1400612 ILMN Controls ILMN_Controls ERCC-00084 ILMN_333659 ERCC-00084 ERCC-00084
730156 ILMN Controls ILMN_Controls ERCC-00017 ILMN_333592 ERCC-00017 ERCC-00017
I would like the output file to look like this:
我希望输出文件如下所示:
out.txt
1050008 5.156725968 8.404038296 124.9198605 3.23E-21 2.33E-17 38.57865782 1050008 ILMN Controls ILMN_Controls ERCC-00009 ILMN_333584 ERCC-00009 ERCC-00009
3310747 5.631470026 8.581936875 124.6039122 3.34E-21 2.33E-17 38.55204806 3310747 ILMN Controls ILMN_Controls ERCC-00144 ILMN_333719 ERCC-00144 ERCC-00144
5910451 4.900364671 8.455329195 124.5720603 3.35E-21 2.33E-17 38.54935989 5910451 ILMN Controls ILMN_Controls ERCC-00003 ILMN_333578 ERCC-00003 ERCC-00003
730156 5.565210738 8.48792701 122.2168789 4.28E-21 2.33E-17 38.34773989 730156 ILMN Controls ILMN_Controls ERCC-00017 ILMN_333592 ERCC-00017 ERCC-00017
The files are tab delimited and have missing values in some columns.
这些文件以制表符分隔,并且在某些列中缺少值。
There is 31 columns in file2.txt and >47000 lines and I'm trying to do this in bash (OSX)
file2.txt 中有 31 列和 >47000 行,我正在尝试在 bash (OSX) 中执行此操作
If you have a solution I would greatly appreciate if you could briefly explainn the steps as I'm very new to this.
如果您有解决方案,我将不胜感激,如果您能简要解释这些步骤,我将不胜感激,因为我对此很陌生。
回答by Dimitre Radoulov
awk 'BEGIN {
FS = OFS = "\t"
}
NR == FNR {
# while reading the 1st file
# store its records in the array f
f[] = join <(sort file1.txt) <(sort file2.txt) >out.txt
next
}
in f {
# when match is found
# print all values
print f[], ##代码##
}' file1 file2

