bash 在一个文件中查找不在另一个文件中的行的快速方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18204904/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 00:03:32  来源:igfitidea点击:

Fast way of finding lines in one file that are not in another?

bashgrepfinddiff

提问by Niels2000

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.

我有两个大文件(文件名集)。每个文件大约有 30.000 行。我试图找到一种快速的方法来查找文件 1 中文件 2 中不存在的行。

For example, if this is file1:

例如,如果这是 file1:

line1
line2
line3

And this is file2:

这是文件2:

line1
line4
line5

Then my result/output should be:

那么我的结果/输出应该是:

line2
line3

This works:

这有效:

grep -v -f file2 file1

grep -v -f file2 file1

But it is very, very slow when used on my large files.

但是在我的大文件上使用时它非常非常慢。

I suspect there is a good way to do this using diff(), but the output should be justthe lines, nothing else, and I cannot seem to find a switch for that.

我怀疑使用 diff() 有一个很好的方法来做到这一点,但输出应该只是行,没有别的,我似乎找不到一个开关。

Can anyone help me find a fast way of doing this, using bash and basic linux binaries?

谁能帮我找到一种快速的方法,使用 bash 和基本的 linux 二进制文件?

EDIT: To follow up on my own question, this is the best way I have found so far using diff():

编辑:为了跟进我自己的问题,这是我迄今为止使用 diff() 找到的最好方法:

diff file2 file1 | grep '^>' | sed 's/^>\ //'

Surely, there must be a better way?

当然,一定有更好的方法吗?

采纳答案by mr.spuratic

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diffoutput:

您可以通过控制 GNUdiff输出中旧/新/未更改行的格式来实现此目的:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

The input files should be sortedfor this to work. With bash(and zsh) you can sort in-place with process substitution <( ):

应该对输入文件进行排序以使其工作。使用bash(and zsh) 您可以使用流程替换就地排序<( )

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

In the above newand unchangedlines are suppressed, so only changed(i.e. removed lines in your case) are output. You may also use a few diffoptions that other solutions don't offer, such as -ito ignore case, or various whitespace options (-E, -b, -vetc) for less strict matching.

在上面的行和未更改的行被抑制,因此仅输出更改(即在您的情况下已删除的行)。你也可以使用一些diff选项,其他解决方案不提供,如-i忽略大小写,或各种空白选项(-E-b-v对于不太严格的匹配等)。



Explanation

解释

The options --new-line-format, --old-line-formatand --unchanged-line-formatlet you control the way diffformats the differences, similar to printfformat specifiers. These options format new(added), old(removed) and unchangedlines respectively. Setting one to empty "" prevents output of that kind of line.

options --new-line-format--old-line-format--unchanged-line-format让您控制diff格式差异的方式,类似于 printf格式说明符。这些选项分别格式化(添加)、(删除)和未更改的行。将一个设置为空 "" 可防止输出该类型的行。

If you are familiar with unified diffformat, you can partly recreate it with:

如果您熟悉统一差异格式,则可以使用以下命令部分重新创建它:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
     --new-line-format="+%L" file1 file2

The %Lspecifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u(note that it only outputs differences, it lacks the ---+++and @@lines at the top of each grouped change). You can also use this to do other useful things like number each linewith %dn.

%L说明符是有问题的行,我们每个前缀为“+”,“ - ”或“”,像diff -u(注意,只是输出不同,它缺乏---+++@@线在每个分组改变的顶部)。您也可以使用此做其他有用的东西像数每行%dn



The diffmethod (along with other suggestions command join) only produce the expected output with sortedinput, though you can use <(sort ...)to sort in place. Here's a simple awk(nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, andoutputs the missing lines in the order they occur in file1.

diff方法(以及其他建议commjoin)仅生成具有排序输入的预期输出,但您可以使用<(sort ...)原位排序。这是一个简单的awk(nawk)脚本(受 Konsolebox 答案中链接到的脚本的启发),它接受任意排序的输入文件,按照它们在 file1 中出现的顺序输出丢失的行。

# output lines in file1 that are not in file2
BEGIN { FS="" }                         # preserve whitespace
(NR==FNR) { ll1[FNR]=
BEGIN { FS="" }
(NR==FNR) {  # file1, index by lineno and string
  ll1[FNR]=
split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1
; ss1[
#find lines only in file1
comm -23 file1 file2 

#find lines only in file2
comm -13 file1 file2 

#find lines common to both files
comm -12 file1 file2 
]=FNR; nl1=FNR; } (NR!=FNR) { # file2 if (
grep -v -f file2 file1
in ss1) { delete ll1[ss1[
grep -F -x -v -f file2 file1
]]; delete ss1[
sort file1 -u > file1.sorted
sort file2 -u > file2.sorted
diff file1.sorted file2.sorted
]; } } END { for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll] }
; nl1=FNR; } # file1, index by lineno (NR!=FNR) { ss2[
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
]++; } # file2, index by string END { for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll] }

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1and use the inoperator to determine if the line in file1 is present in file2. (This will have have different output to the diffmethod if there are duplicates.)

这将 file1 的全部内容逐行存储在行号索引数组中ll1[],并将 file2 的全部内容逐行存储在行内容索引关联数组中ss2[]。读取两个文件后,迭代ll1并使用in运算符确定 file1 中的行是否存在于 file2 中。(diff如果有重复,这将对方法有不同的输出。)

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

如果文件足够大以至于存储它们都会导致内存问题,您可以通过仅存储 file1 并在读取 file2 的过程中删除匹配项来交换 CPU 内存。

seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique

# Output:
1
2    

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[]and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

上面将 file1 的全部内容存储在两个数组中,一个按行号ll1[]索引,一个按行内容索引ss1[]。然后在读取 file2 时,从ll1[]和 中删除每个匹配的行ss1[]。最后输出 file1 中的剩余行,保留原始顺序。

In this case, with the problem as stated, you can also divide and conquerusing GNU split(filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

在这种情况下,对于上述问题,您还可以使用 GNU分而治之split(过滤是 GNU 扩展),重复运行文件 1 的块并每次完全读取文件 2:

$ join -v 1 -t '' file1 file2
line2
line3

Note the use and placement of -meaning stdinon the gawkcommand line. This is provided by splitfrom file1 in chunks of 20000 line per-invocation.

需要注意的使用和放置-意义stdin上的gawk命令行。这是由splitfile1 以每次调用 20000 行的块提供的。

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcodetools which provides GNU diff, awk, though only a POSIX/BSD splitrather than a GNU version.

对于非 GNU 系统上的用户,几乎可以肯定您可以获得 GNU coreutils 包,包括在 OSX 上作为Apple Xcode工具的一部分提供 GNU diff, awk,尽管只是 POSIX/BSDsplit而不是 GNU 版本。

回答by JnBrymn

The commcommand (short for "common") may be useful comm - compare two sorted files line by line

通讯命令(简称“常用”)可能是有用的comm - compare two sorted files line by line

combine file1 not file2

The manfile is actually quite readable for this.

man文件实际上是非常可读的。

回答by pbz

Like konsolebox suggested, the posters grep solution

就像 konsolebox 建议的那样,海报 grep 解决方案

python -c '
lines_to_remove = set()
with open("file2", "r") as f:
    for line in f.readlines():
        lines_to_remove.add(line.strip())

with open("f1", "r") as f:
    for line in f.readlines():
        if line.strip() not in lines_to_remove:
            print(line.strip())
'

actually works great (fast) if you simply add the -Foption, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With -Fit took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output to wc -l.

如果您只是添加-F选项,将模式视为固定字符串而不是正则表达式,那么实际上效果很好(快速)。我在必须比较的一对 ~1000 行文件列表上验证了这一点。随着-F花了0.031秒(实际),而无需花了2.278秒(实际),重定向grep的输出结果的时候wc -l

These tests also included the -xswitch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.

这些测试还包括-x开关,这是解决方案的必要部分,以确保在 file2 包含的行与 file1 中的一个或多个行的一部分(但不是全部)匹配的情况下的完全准确性。

So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is:

因此,不需要对输入进行排序、快速、灵活(区分大小写等)的解决方案是:

##代码##

This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. Alternatively you can install GNU grep on macOSin order to use this solution.

这不适用于所有版本的 grep,例如它在 macOS 中失败,其中文件 1 中的一行将显示为文件 2 中不存在,即使它与作为其子字符串的另一行匹配. 或者,您可以在 macOS 上安装 GNU grep以使用此解决方案。

回答by Puggan Se

whats the speed of as sort and diff?

as sort 和 diff 的速度是多少?

##代码##

回答by Ondra ?i?ka

If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat, sortand uniq:

如果你缺少“花哨的工具”,例如在一些最小的 Linux 发行版中,有一个解决方案,只需cat,sortuniq

##代码##

Test:

测试:

##代码##

This is also relativelyfast, compared to grep.

这也是比较快,比较grep

回答by Steven Penny

##代码##

The -tmakes sure that it compares the whole line, if you had a space in some of the lines.

-t确保它的整体线条比较,如果你有一些行的空间。

回答by GypsyCosmonaut

Use combinefrom moreutilspackage, a sets utility that supports not, and, or, xoroperations

使用combinefrom moreutilspackage,一个支持not, and, or,xor操作的集合工具

##代码##

i.e give me lines that are in file1 but not in file2

即给我文件1中但不在文件2中的行

OR give me lines in file1 minus lines in file2

或者给我文件 1 中的行减去文件 2 中的行

Note:combinesorts and finds unique lines in both files before performing any operation but diffdoes not. So you might find differences between output of diffand combine.

注意:combine在执行任何操作之前对两个文件中的唯一行进行排序和查找,但diff不会。因此,您可能会发现diff和 的输出之间存在差异combine

So in effect you are saying

所以实际上你是说

Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2

在 file1 和 file2 中找到不同的行,然后在 file1 中给我行减去 file2 中的行

In my experience, it's much faster than other options

根据我的经验,它比其他选项快得多

回答by HelloGoodbye

You can use Python:

您可以使用 Python:

##代码##

回答by konsolebox

Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.

使用 fgrep 或向 grep 添加 -F 选项可能会有所帮助。但是为了更快的计算,你可以使用 Awk。

You could try one of these Awk methods:

您可以尝试以下 awk 方法之一:

http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219

http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219

回答by BAustin

The way I usually do this is using the --suppress-common-linesflag, though note that this only works if your do it in side-by-side format.

我通常这样做的方法是使用--suppress-common-lines标志,但请注意,这仅在您以并排格式进行时才有效。

diff -y --suppress-common-lines file1.txt file2.txt

diff -y --suppress-common-lines file1.txt file2.txt