bash 猛击。从多个文件中获取交集

Question

提问by Jonovono

So let me explain this a bit more:

所以让我再解释一下：

I have a directory called tags that has a file for each tag, something like:

我有一个名为 tags 的目录，其中每个标签都有一个文件，例如：

tags/
    t1
    t2
    t3

In each of the tag files is a structure like:

在每个标签文件中都有一个结构，如：

<inode> <filename> <filepath>

Of course, each tag file will have a list of many files with that tag (but a file can only appear in the one tag file once). And a file may be in multiple tag files.

当然，每个标签文件都会有一个包含该标签的许多文件的列表（但一个文件只能在一个标签文件中出现一次）。并且一个文件可能在多个标签文件中。

What I want to be able to do is call a command like

我想要做的是调用一个命令

tags <t1> <t2>

and have it list the files that have BOTH the tags t1 and t2 in a nice way.

并让它以一种很好的方式列出同时具有标签 t1 和 t2 的文件。

My plan right now was to make a temp file. Basically output the entire file of t1 into it. Then run through each line in t2 and do an awk on the file. And just keep doing that.

我现在的计划是制作一个临时文件。基本上将t1的整个文件输出到其中。然后遍历 t2 中的每一行并对文件执行 awk。继续这样做。

But I am wondering if anyone has any other ways. I am not overly familiar with awk, grep etc.

但我想知道是否有人有其他方法。我对 awk、grep 等不太熟悉。

Answer 1

回答by jkshah

You could try with commutility

您可以尝试使用comm实用程序

comm -12 <t1> <t2>

commwith appropriate combination of followinng optionns can be useful for different set operations on file contents.

comm适当组合以下选项可用于对文件内容进行不同的设置操作。

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

This assumes <t1>and <t2>are sorted. If not, they should be first sorted with sort

这假设<t1>和<t2>排序。如果不是，则应首先对它们进行排序sort

Answer 2

回答by Adam Liss

Can you use

你能用吗

sort t1 t2 | uniq -d

This will combine the two files, sort them, and then display only the lines that appear more than once: that is, the ones that appear in both files.

这将合并两个文件，对它们进行排序，然后仅显示出现多次的行：即出现在两个文件中的行。

This assumes that each file contains no duplicates within it, and that the inodes are the same in all the structures for a particular file.

这假设每个文件中不包含重复项，并且特定文件的所有结构中的 inode 都相同。

Answer 3

回答by Krister Janmore

Using awk, it's quite easy to create single-command solution that works for an arbitrary number of unsorted files. For large files, it can be much quicker than using sortand pipes, as I show below. By changing $0to $1etc., you can also find the intersection of specific columns.

使用awk，可以很容易地创建适用于任意数量的未排序文件的单命令解决方案。对于大文件，它可以比使用sort和管道快得多，如下所示。通过更改$0为$1等，还可以找到特定列的交集。

I've included 3 solutions: a simple one that does not handle duplicated lines within files; a more complicated one that does handle them; and an even more complicated one that also handles them and is (over-)engineered for performance. Solutions #1 and #2 assume a version of awkthat has the FNRvariable, and solution #3 requires gawk's ENDFILE(although this can be circumvented by using FNR == 1instead and rearranging some logic).

我已经包含了 3 个解决方案：一个不处理文件中重复行的简单解决方案；一个更复杂的处理它们；还有一个更复杂的，也可以处理它们并且（过度）设计性能。解决方案#1 和#2 假设awk具有FNR变量的版本，而解决方案#3 requires gawk's ENDFILE（尽管这可以通过使用FNR == 1代替并重新排列一些逻辑来规避）。

Solution #1 (does not handle duplicated lines within files):

解决方案#1（不处理文件中的重复行）：

awk ' FNR == 1 { b++ } { a[awk ' FNR == 1 { b++ ; delete c }
      c[awk ' b == 0 { a[FNR == 1 { b++ }              # when awk reads the first line of a new file, FNR resets 
                              # to 1. every time FNR == 1, we increment a counter 
                              # variable b. 
                              # this counts the number of input files.

{ a[c[### Create test data with *no duplicated lines within files*

mkdir test_dir; cd test_dir

for i in {1..30}; do shuf -i 1-540000 -n 500000 > test_no_dups${i}.txt; done

### Solution #0: based on sort and uniq

time sort test_no_dups*.txt | uniq -c | sed -n 's/^ *30 //p' > intersect_no_dups.txt

# real    0m12.982s
# user    0m51.594s
# sys     0m3.250s

wc -l < intersect_no_dups.txt # 53772

### Solution #1:

time \
awk ' FNR == 1 { b++ }
      { a[### Create test data containing repeated lines (-r: sample w/ replacement)

for i in {1..30} ; do
    shuf -r -i 1-150000 -n 500000 > test_dups${i}.txt
done


### Solution #0: based on sort and uniq

time \
for i in test_dups*.txt ; do
    sort -u "$i"
done \
| sort \
| uniq -c \
| sed -n 's/^ *30 //p' \
> intersect_dups.txt

# real   0m13.503s
# user   0m26.688s
# sys    0m2.297s

wc -l < intersect_dups.txt # 50389

### [Solution #1 won't work here]

### Solution #2:

# note: `delete c` can be replaced with `split("", c)`
time \
awk ' FNR == 1 { b++ ; delete c }
      c[eval `perl -le 'print "cat ",join(" | grep -xF -f- ", @ARGV)' t*`
] == 0 { a[cat t1 | grep -xF -f- t2 | grep -xF -f- t3
]++ ; c[seq 0 20 | tee t1; seq 0 2 20 | tee t2; seq 0 3 20 | tee t3
] = 1 }
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    test_dups*.txt \
  > intersect_dups.txt

# real   0m7.097s
# user   0m6.891s
# sys    0m0.188s

wc -l < intersect_dups.txt # 50389

### Solution #3:

time \
awk ' b == 0 { a[0
6
12
18
] = 0 ; next } 
      ##代码## in a { a[##代码##] = 1 } 
      ENDFILE { 
          if (b == 0) { b = 1 } 
          else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } } 
      }
      END { for (i in a) { print i } } ' \
      test_dups*.txt \
  > intersect_dups.txt

# real   0m4.616s
# user   0m4.375s
# sys    0m0.234s

wc -l < intersect_dups.txt # 50389


]++ } 
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    test_no_dups*.txt \
  > intersect_no_dups.txt

# real    0m8.048s
# user    0m7.484s
# sys     0m0.313s

wc -l < intersect_no_dups.txt # 53772

### Solution #2:

time \
awk ' FNR == 1 { b++ ; delete c }
      c[##代码##] == 0 { a[##代码##]++ ; c[##代码##] = 1 }
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    test_no_dups*.txt \
  > intersect_no_dups.txt

# real    0m14.965s
# user    0m14.688s
# sys     0m0.297s

wc -l < intersect_no_dups.txt # 53772

### Solution #3:

time \
awk ' b == 0 { a[##代码##] = 0 ; next } 
      ##代码## in a { a[##代码##] = 1 } 
      ENDFILE { 
          if (b == 0) { b = 1 } 
          else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } } 
      }
      END { for (i in a) { print i } } ' \
      test_no_dups*.txt \
  > intersect_no_dups.txt

# real    0m5.929s
# user    0m5.672s
# sys     0m0.250s

wc -l < intersect_no_dups.txt # 53772

] == 0 { a[##代码##]++ ; c[##代码##] = 1 }  # as above, but now we include an array c that 
                                    # indicates if we've seen lines *within* each file.
                                    # if we haven't seen the line before in this file, we 
                                    # increment the count at that line(/key) in array a. 
                                    # we also set the value at that key in array c to 1 
                                    # to note that we've now seen it in this file before.

FNR == 1 { b++ ; delete c }         # as previous solution, but now we also clear the 
                                    # array c between files.
]++ }                   # on every line in every file, take the whole line ( ##代码## ), 
                              # use it as a key in the array a, and increase the value 
                              # of a[##代码##] by 1.
                              # this counts the number of observations of line ##代码## across 
                              # all input files.

END { ... }                   # after reading the last line of the last file...

for (i in a) { ... }          # ... loop over the keys of array a ...

if (a[i] == b) { ... }        # ... and if the value at that key is equal to the number 
                              # of input files...

print i                       # ... we print the key - i.e. the line.

] = 0 ; next } 
      ##代码## in a { a[##代码##] = 1 } 
      ENDFILE { 
          if (b == 0) { b = 1 } 
          else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } } 
      }
      END { for (i in a) { print i } } ' \
    t1 t2 t3
] == 0 { a[##代码##]++ ; c[##代码##] = 1 }
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    t1 t2 t3
]++ } END { for (i in a) { if (a[i] == b) { print i } } } ' \
    t1 t2 t3

Solution #2 (handles duplicated lines within files):

解决方案#2（处理文件中的重复行）：

##代码##

Solution #3 (performant, handles duplicates within files, but complex, and as written relies on gawk's ENDFILE):

解决方案 #3（性能良好，处理文件中的重复项，但很复杂，并且书面上依赖于gawk's ENDFILE）：

##代码##

Explanation for #1:

#1 的解释：

##代码##

Explanation for #2:

#2 的解释：

##代码##

Explanation for #3:

#3 的解释：

This post is already quite long so I won't do a line-by-line for this solution. But in short: 1) we create an array athat includes every line in the first file as a key, with all values set to 0; 2) on subsequent files, if that line is a key in a, we set the value at that key to 1; 3) at the end of each file, we delete all keys in athat have value 0(indicating we didn't see it in the previous file), and reset all remaining values to 0; 4) after all files have been read, print every key that's left in a. We get a good speed-up here because instead of having to keep and search through an array of every single line we've seen so far, we're only keeping an array of lines that are the intersection of all previous files, which (usually!) shrinks with each new file.

这篇文章已经很长了，所以我不会对此解决方案逐行进行。但简而言之：1) 我们创建了一个数组a，其中包含第一个文件中的每一行作为键，所有值都设置为0; 2) 在后续文件中，如果该行是一个键a，我们将该键的值设置为1; 3）在每个文件的末尾，我们删除所有a具有值的键0（表明我们在前一个文件中没有看到它），并将所有剩余的值重置为0；4）读取所有文件后，打印剩下的每个键a. 我们在这里得到了很好的加速，因为我们不必保留和搜索我们迄今为止看到的每一行的数组，我们只保留所有先前文件的交集的行的数组，其中（通常！）随着每个新文件缩小。

Benchmarking:

基准测试：

Note: the improvement in runtime appears to get more significant as the lines within files get longer.

注意：随着文件中的行变长，运行时的改进似乎变得更加显着。

##代码##

And if files can contain duplicates:

如果文件可以包含重复项：

##代码##

Answer 4

回答by bsb

Version for multiple files:

多个文件的版本：

##代码##

Expands to:

扩展为：

##代码##

Test files:

测试文件：

##代码##

Output:

输出：

##代码##

bash 猛击。从多个文件中获取交集

提问by Jonovono

回答by jkshah

回答by Adam Liss

回答by Krister Janmore

回答by bsb

相关推荐

最近更新

标签

bash 猛击。从多个文件中获取交集

提问by Jonovono

回答by jkshah

回答by Adam Liss

回答by Krister Janmore

回答by bsb

相关推荐

bash: /bin/myscript: 权限被拒绝

bash 查找世界可读的所有目录

bash 用文件递归触摸文件

自定义 Bash 提示正在覆盖自身

相关推荐

最近更新

标签