bash 比较两个大文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10801760/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-18 02:23:55  来源:igfitidea点击:

Comparing two large files

pythonalgorithmbashgreplarge-files

提问by user1155413

I need to write a program that will write to a file the difference between two files. The program has to loop through a 600 MB file with over 13.464.448 lines, check if a grep returns true on another file and then write the result onto another file. I wrote a quick test with about 1.000.000 records and it took over an hour, so i'm guessing this approach could take 9+ hours.

我需要编写一个程序,将两个文件之间的差异写入文件。该程序必须循环遍历一个超过 13.464.448 行的 600 MB 文件,检查 grep 是否在另一个文件上返回 true,然后将结果写入另一个文件。我用大约 1.000.000 条记录写了一个快速测试,花了一个多小时,所以我猜这种方法可能需要 9 个多小时。

Do you have any recommendations on how to make this faster? Any particular language i should use? I was planning on doing it in bash or python.

您对如何加快速度有什么建议吗?我应该使用任何特定的语言?我打算用 bash 或 python 来做。

Thanks a lot in advance.

非常感谢。

[EDIT 1]:Sorry, when i say difference between two files i did not mean a diff. The result file is in a different format.

[编辑 1]:抱歉,当我说两个文件之间的差异时,我并不是指差异。结果文件采用不同的格式。

The logic is a bit like this:

逻辑有点像这样:

File A has 297.599 lines File B has over 13 million lines

文件 A 有 297.599 行文件 B 有超过 1300 万行

I pick the current line being read from FILE A, grep it on FILE B, and if the line is present in File B i will write it to the result file. By the way, File A and File B have different formats. Result file will have the format of File A.

我选择从文件 A 中读取的当前行,在文件 B 上 grep 它,如果文件 B 中存在该行,我会将其写入结果文件。顺便说一下,文件 A 和文件 B 具有不同的格式。结果文件将具有文件 A 的格式。

[EDIT 2]:I was asked at work to create a bash solution ideally so that we don't have to install python on all the machines this has to run on.

[编辑 2]:我在工作中被要求创建一个理想的 bash 解决方案,这样我们就不必在所有必须运行的机器上安装 python。

This is my curent implementation:

这是我目前的实现:

#!/bin/bash

LAST_TTP=`ls -ltr TTP_*.txt | tail -1 | awk '{ print  }'`
LAST_EXP=`ls -ltr *.SSMT | tail -1 | awk '{ print  }'`

while read -r line; do
   MATCH="$(grep $line $LAST_EXP)"
   echo "line: $line, match: $MATCH"

   # if not empty
   if [ ! -z "$MATCH" ]
   then
      echo $MATCH >> result
   fi

done < $LAST_TTP

This bash approach is taking over 10 hours to complete. Do you have any suggestions on how to make it more efficient in bash?

这种 bash 方法需要 10 多个小时才能完成。您对如何在 bash 中提高效率有什么建议吗?

Thanks a lot in advance!

非常感谢!

回答by phihag

You're probably looking in a list instead of a set, leading to an O(n2) performance. Try:

您可能正在查看列表而不是集合,从而导致 O(n2) 性能。尝试:

with open('b') as b:
  blines = set(b)
with open('a') as a:
  with open('result', 'w') as result:
    for line in a:
      if line not in blines:
        result.write(line)

Assuming uniformly long (and not overly long lines), the performance of this implementation is in O(|A| + |B|)(amortized, due to Python's setbeing extremely fast). The memory demand is in O(|B|), but with a factor significantly greater than 1.

假设一致的长度(而不是过长的行),这个实现的性能是O(|A| + |B|)(摊销,由于Python 的set速度非常快)。内存需求在 中O(|B|),但因子显着大于 1。

If the order of lines in the output does not matter, you can also sort both files and then compare them line-by-line. This will have a performance in the order of O(|A| log |A| + B log |B|). The memory demand will be in O(|A|+|B|), or more precisely, |A|+ |B|.

如果输出中的行顺序无关紧要,您还可以对两个文件进行排序,然后逐行比较它们。这将有一个顺序的性能O(|A| log |A| + B log |B|)。内存需求将在O(|A|+|B|),或更准确地说,|A|+ |B|

回答by Mark Ransom

Sort each of the input files. Now read one line from each. If one line compares less than the other, output it as a difference and read the next line from that file. If both lines compare equal, read the next line from both files. If you read to the end of one file, all the lines in the other file are differences.

对每个输入文件进行排序。现在每人读一行。如果一行的比较小于另一行,则将其作为差异输出并从该文件中读取下一行。如果两行比较相等,则从两个文件中读取下一行。如果您读到一个文件的末尾,则另一个文件中的所有行都是不同的。

This is an O(n log n) algorithm as opposed to the O(n^2) algorithm you started with.

这是一个 O(n log n) 算法,而不是您开始使用的 O(n^2) 算法。

回答by Paused until further notice.

You can speed up your script slightly by stopping the grepwhen it finds the first match if that's appropriate for your needs.

grep如果满足您的需要,您可以通过在找到第一个匹配项时停止来稍微加快脚本速度。

If your grepsupports it use grep -m 1.

如果您grep支持它,请使用grep -m 1.

Your problem is that you're spawning grepalmost 300,000 times and it's reading over 13,000,000 lines each time.

您的问题是您产生了grep近 300,000 次,每次读取超过 13,000,000 行。

Stopping grepat the first match will help a little, but the overhead of all those execs is a big factor as well. An implementation in Python will eliminate this issue.

grep在第一场比赛中停下来会有所帮助,但所有这些执行官的开销也是一个重要因素。Python 中的实现将消除这个问题。

As for choosing the files in your script, please see BashFAQ/003and Parsing ls.

至于在脚本中选择文件,请参阅BashFAQ/003Parsingls

回答by Chen Levy

Your implementation seems to do:

您的实现似乎是这样做的:

grep --fixed-strings --file=file_B file_A > result_file

But both @phihag's and @mark Ronsam's answers are better solutions.

但是@phihag 和@mark Ronsam 的答案都是更好的解决方案。

Also, if you wish to use the heavy guns, your solution is a good candidate for map-reduce frameworks such as hadoop.

此外,如果您希望使用重型枪械,那么您的解决方案非常适合使用诸如 hadoop 之类的 map-reduce 框架。

回答by Fabian Büttner

The join command also does what you want. join requires both files to be sorted up front. The option -v prints a line for each unpairable line.

join 命令也可以执行您想要的操作。join 需要预先对两个文件进行排序。选项 -v 为每个不可配对的行打印一行。

So you will want something like

所以你会想要类似的东西

join -v 1 sortedfile1 sortedfile2

join -v 1 sortedfile1 sortedfile2

(you will need to set the join options according to your file format, see manpage of join)

(您需要根据您的文件格式设置连接选项,请参阅连接手册页)

The below example joins the files test1.txt and test2.txt using the second resp. first field for the join. Assuming the files were sorted upfront using the sort command. The -v 1 option only outputs the lines of test1.txt could not be joined.

下面的示例使用第二个响应连接文件 test1.txt 和 test2.txt。联接的第一个字段。假设文件是​​使用 sort 命令预先排序的。-v 1 选项仅输出 test1.txt 无法连接的行。

>cat test1.txt
a 1 2
b 2 3

> cat test2.txt
1 x
4 x

> join -v 1  -1 2  -2 1  test1.txt test2.txt
2 b 3

> join -v 1 -1 2 -2 1 -o 1.1 1.2 1.3 test1.txt test2.txt
b 2 3

sort and join both work fine with large files.

sort 和 join 都可以很好地处理大文件。

回答by frankc

I would consider the comm command. It should be faster than grep because it will work with sorted data while grep will always do a linear search

我会考虑 comm 命令。它应该比 grep 快,因为它可以处理排序的数据,而 grep 将始终进行线性搜索

comm -2 -3 <(sort file1) <(sort file2)

回答by lind

You can also use awk:

您还可以使用 awk:

awk 'NR==FNR { arrA[$0]; next } $0 in arrA' file_A file_B > result

awk 'NR==FNR { arrA[$0]; next } $0 in arrA' file_A file_B > result

The order of files (... file_A file_B) in the command line is very important.

命令行中文件的顺序(...file_A file_B)非常重要。