bash 猛击。从多个文件中获取交集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19214179/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Bash. Get intersection from multiple files
提问by Jonovono
So let me explain this a bit more:
所以让我再解释一下:
I have a directory called tags that has a file for each tag, something like:
我有一个名为 tags 的目录,其中每个标签都有一个文件,例如:
tags/
t1
t2
t3
In each of the tag files is a structure like:
在每个标签文件中都有一个结构,如:
<inode> <filename> <filepath>
Of course, each tag file will have a list of many files with that tag (but a file can only appear in the one tag file once). And a file may be in multiple tag files.
当然,每个标签文件都会有一个包含该标签的许多文件的列表(但一个文件只能在一个标签文件中出现一次)。并且一个文件可能在多个标签文件中。
What I want to be able to do is call a command like
我想要做的是调用一个命令
tags <t1> <t2>
and have it list the files that have BOTH the tags t1 and t2 in a nice way.
并让它以一种很好的方式列出同时具有标签 t1 和 t2 的文件。
My plan right now was to make a temp file. Basically output the entire file of t1 into it. Then run through each line in t2 and do an awk on the file. And just keep doing that.
我现在的计划是制作一个临时文件。基本上将t1的整个文件输出到其中。然后遍历 t2 中的每一行并对文件执行 awk。继续这样做。
But I am wondering if anyone has any other ways. I am not overly familiar with awk, grep etc.
但我想知道是否有人有其他方法。我对 awk、grep 等不太熟悉。
回答by jkshah
You could try with comm
utility
您可以尝试使用comm
实用程序
comm -12 <t1> <t2>
comm
with appropriate combination of followinng optionns can be useful for different set operations on file contents.
comm
适当组合以下选项可用于对文件内容进行不同的设置操作。
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
This assumes <t1>
and <t2>
are sorted. If not, they should be first sorted with sort
这假设<t1>
和<t2>
排序。如果不是,则应首先对它们进行排序sort
回答by Adam Liss
Can you use
你能用吗
sort t1 t2 | uniq -d
This will combine the two files, sort them, and then display only the lines that appear more than once: that is, the ones that appear in both files.
这将合并两个文件,对它们进行排序,然后仅显示出现多次的行:即出现在两个文件中的行。
This assumes that each file contains no duplicates within it, and that the inodes are the same in all the structures for a particular file.
这假设每个文件中不包含重复项,并且特定文件的所有结构中的 inode 都相同。
回答by Krister Janmore
Using awk
, it's quite easy to create single-command solution that works for an arbitrary number of unsorted files. For large files, it can be much quicker than using sort
and pipes, as I show below. By changing $0
to $1
etc., you can also find the intersection of specific columns.
使用awk
,可以很容易地创建适用于任意数量的未排序文件的单命令解决方案。对于大文件,它可以比使用sort
和管道快得多,如下所示。通过更改$0
为$1
等,还可以找到特定列的交集。
I've included 3 solutions: a simple one that does not handle duplicated lines within files; a more complicated one that does handle them; and an even more complicated one that also handles them and is (over-)engineered for performance. Solutions #1 and #2 assume a version of awk
that has the FNR
variable, and solution #3 requires gawk
's ENDFILE
(although this can be circumvented by using FNR == 1
instead and rearranging some logic).
我已经包含了 3 个解决方案:一个不处理文件中重复行的简单解决方案;一个更复杂的处理它们;还有一个更复杂的,也可以处理它们并且(过度)设计性能。解决方案#1 和#2 假设awk
具有FNR
变量的版本,而解决方案#3 requires gawk
's ENDFILE
(尽管这可以通过使用FNR == 1
代替并重新排列一些逻辑来规避)。
Solution #1 (does not handle duplicated lines within files):
解决方案#1(不处理文件中的重复行):
awk ' FNR == 1 { b++ } { a[awk ' FNR == 1 { b++ ; delete c }
c[awk ' b == 0 { a[FNR == 1 { b++ } # when awk reads the first line of a new file, FNR resets
# to 1. every time FNR == 1, we increment a counter
# variable b.
# this counts the number of input files.
{ a[c[### Create test data with *no duplicated lines within files*
mkdir test_dir; cd test_dir
for i in {1..30}; do shuf -i 1-540000 -n 500000 > test_no_dups${i}.txt; done
### Solution #0: based on sort and uniq
time sort test_no_dups*.txt | uniq -c | sed -n 's/^ *30 //p' > intersect_no_dups.txt
# real 0m12.982s
# user 0m51.594s
# sys 0m3.250s
wc -l < intersect_no_dups.txt # 53772
### Solution #1:
time \
awk ' FNR == 1 { b++ }
{ a[### Create test data containing repeated lines (-r: sample w/ replacement)
for i in {1..30} ; do
shuf -r -i 1-150000 -n 500000 > test_dups${i}.txt
done
### Solution #0: based on sort and uniq
time \
for i in test_dups*.txt ; do
sort -u "$i"
done \
| sort \
| uniq -c \
| sed -n 's/^ *30 //p' \
> intersect_dups.txt
# real 0m13.503s
# user 0m26.688s
# sys 0m2.297s
wc -l < intersect_dups.txt # 50389
### [Solution #1 won't work here]
### Solution #2:
# note: `delete c` can be replaced with `split("", c)`
time \
awk ' FNR == 1 { b++ ; delete c }
c[eval `perl -le 'print "cat ",join(" | grep -xF -f- ", @ARGV)' t*`
] == 0 { a[cat t1 | grep -xF -f- t2 | grep -xF -f- t3
]++ ; c[seq 0 20 | tee t1; seq 0 2 20 | tee t2; seq 0 3 20 | tee t3
] = 1 }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
test_dups*.txt \
> intersect_dups.txt
# real 0m7.097s
# user 0m6.891s
# sys 0m0.188s
wc -l < intersect_dups.txt # 50389
### Solution #3:
time \
awk ' b == 0 { a[0
6
12
18
] = 0 ; next }
##代码## in a { a[##代码##] = 1 }
ENDFILE {
if (b == 0) { b = 1 }
else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } }
}
END { for (i in a) { print i } } ' \
test_dups*.txt \
> intersect_dups.txt
# real 0m4.616s
# user 0m4.375s
# sys 0m0.234s
wc -l < intersect_dups.txt # 50389
]++ }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
test_no_dups*.txt \
> intersect_no_dups.txt
# real 0m8.048s
# user 0m7.484s
# sys 0m0.313s
wc -l < intersect_no_dups.txt # 53772
### Solution #2:
time \
awk ' FNR == 1 { b++ ; delete c }
c[##代码##] == 0 { a[##代码##]++ ; c[##代码##] = 1 }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
test_no_dups*.txt \
> intersect_no_dups.txt
# real 0m14.965s
# user 0m14.688s
# sys 0m0.297s
wc -l < intersect_no_dups.txt # 53772
### Solution #3:
time \
awk ' b == 0 { a[##代码##] = 0 ; next }
##代码## in a { a[##代码##] = 1 }
ENDFILE {
if (b == 0) { b = 1 }
else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } }
}
END { for (i in a) { print i } } ' \
test_no_dups*.txt \
> intersect_no_dups.txt
# real 0m5.929s
# user 0m5.672s
# sys 0m0.250s
wc -l < intersect_no_dups.txt # 53772
] == 0 { a[##代码##]++ ; c[##代码##] = 1 } # as above, but now we include an array c that
# indicates if we've seen lines *within* each file.
# if we haven't seen the line before in this file, we
# increment the count at that line(/key) in array a.
# we also set the value at that key in array c to 1
# to note that we've now seen it in this file before.
FNR == 1 { b++ ; delete c } # as previous solution, but now we also clear the
# array c between files.
]++ } # on every line in every file, take the whole line ( ##代码## ),
# use it as a key in the array a, and increase the value
# of a[##代码##] by 1.
# this counts the number of observations of line ##代码## across
# all input files.
END { ... } # after reading the last line of the last file...
for (i in a) { ... } # ... loop over the keys of array a ...
if (a[i] == b) { ... } # ... and if the value at that key is equal to the number
# of input files...
print i # ... we print the key - i.e. the line.
] = 0 ; next }
##代码## in a { a[##代码##] = 1 }
ENDFILE {
if (b == 0) { b = 1 }
else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } }
}
END { for (i in a) { print i } } ' \
t1 t2 t3
] == 0 { a[##代码##]++ ; c[##代码##] = 1 }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
t1 t2 t3
]++ } END { for (i in a) { if (a[i] == b) { print i } } } ' \
t1 t2 t3
Solution #2 (handles duplicated lines within files):
解决方案#2(处理文件中的重复行):
##代码##Solution #3 (performant, handles duplicates within files, but complex, and as written relies on gawk
's ENDFILE
):
解决方案 #3(性能良好,处理文件中的重复项,但很复杂,并且书面上依赖于gawk
's ENDFILE
):
Explanation for #1:
#1 的解释:
##代码##Explanation for #2:
#2 的解释:
##代码##Explanation for #3:
#3 的解释:
This post is already quite long so I won't do a line-by-line for this solution. But in short: 1) we create an array a
that includes every line in the first file as a key, with all values set to 0
; 2) on subsequent files, if that line is a key in a
, we set the value at that key to 1
; 3) at the end of each file, we delete all keys in a
that have value 0
(indicating we didn't see it in the previous file), and reset all remaining values to 0
; 4) after all files have been read, print every key that's left in a
. We get a good speed-up here because instead of having to keep and search through an array of every single line we've seen so far, we're only keeping an array of lines that are the intersection of all previous files, which (usually!) shrinks with each new file.
这篇文章已经很长了,所以我不会对此解决方案逐行进行。但简而言之:1) 我们创建了一个数组a
,其中包含第一个文件中的每一行作为键,所有值都设置为0
; 2) 在后续文件中,如果该行是一个键a
,我们将该键的值设置为1
; 3)在每个文件的末尾,我们删除所有a
具有值的键0
(表明我们在前一个文件中没有看到它),并将所有剩余的值重置为0
;4)读取所有文件后,打印剩下的每个键a
. 我们在这里得到了很好的加速,因为我们不必保留和搜索我们迄今为止看到的每一行的数组,我们只保留所有先前文件的交集的行的数组,其中(通常!)随着每个新文件缩小。
Benchmarking:
基准测试:
Note: the improvement in runtime appears to get more significant as the lines within files get longer.
注意:随着文件中的行变长,运行时的改进似乎变得更加显着。
##代码##And if files can contain duplicates:
如果文件可以包含重复项:
##代码##回答by bsb
Version for multiple files:
多个文件的版本:
##代码##Expands to:
扩展为:
##代码##Test files:
测试文件:
##代码##Output:
输出:
##代码##