Linux 如何比较两个大文件并获得第三个文件的结果?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10831534/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 06:35:42  来源:igfitidea点击:

How to compare two big files and get results to third file?

linuxbashshellunix

提问by Martin Mocik

I have two files

我有两个文件

1st file is like this:

第一个文件是这样的:

www.example.com
www.domain.com
www.otherexample.com
www.other-domain.com
www.other-example.com
www.exa-ample.com

2nd file is like this (numbers after ;;; are between 0-10):

第二个文件是这样的(;;;后面的数字在0-10之间):

www.example.com;;;2
www.domain.com;;;5
www.other-domain;;;0
www.exa-ample.com;;;4

and i want compare these two files and output to third file like this:

我想比较这两个文件并输出到第三个文件,如下所示:

www.otherexample.com
www.other-example.com

Both files have large size (over 500mb)

两个文件都很大(超过 500mb)

回答by Roman Newaza

You can use:

您可以使用:

$ diff file1 file2 > file3

But it seams to me you want to disregard ;;0part, right? Then you need to process it line by line stripping the last part, and, finally, comparing with diff

但我觉得你想忽略;;0部分,对吧?然后你需要逐行处理它剥离最后一部分,最后,与diff

回答by Levon

You could use the diffcommand and direct the output to a 3 third file. E.g.,?

您可以使用diff命令并将输出定向到第三个文件。例如,?

% diff data1.txt data2.txt > diffs

The diff man pageshows a number of options that give you control over the comparison (processing and output).

DIFF man页面显示了许多,让您控制的比较(处理和输出)选项。

The basic interactive operation without specifying an options, assuming you have the data you show in your post in files data1.txtand data2.txtyields:

不指定选项的基本交互操作,假设您在文件中显示了您在帖子中显示的数据data1.txtdata2.txt产生:

% diff data1.txt data2.txt 

1,6d0
< www.example.com
< www.domain.com
< www.otherexample.com
< www.other-domain.com
< www.other-example.com
< www.exa-ample.com

回答by Alessandro Pezzato

If ais the file with the first content and bis the file with the second content:

如果a是包含第一个内容b的文件和包含第二个内容的文件:

while read line; do grep -q $line b || echo $line; done < a

It prints what is not found in the second file.

它打印在第二个文件中找不到的内容。

回答by camh

Use comm(1)to compare two sorted files and to give the differences. Use grep(1)and sort(1)to get your files into an input format suitable for comparison with comm. Use process substitutionin bashto tie it together:

使用comm(1)比较两个排序的文件,并给予差异。使用grep(1)sort(1)将文件转换为适合与comm. 使用过程中替换bash共同比分扳成:

comm -23 <(sort file1.txt) <(grep -o '^[^;]*' file2.txt | sort)

The -23argument to commsays to ignore lines that are common to both files (-3) and lines unique to file 2 (-2). Depending on your exact specification, you can use -1, -2or -3.

-23参数comm表示忽略文件 ( -3) 和文件 2 独有的行( )共有的行-2。根据您的确切规格,您可以使用-1,-2-3

grep -o '^[^;]*' file2.txtjust strips off everything after the first semicolon. You can use sed(1)for this, but if you are only extracting part of a line and not adding anything else, grepwill often be faster.

grep -o '^[^;]*' file2.txt只是去掉第一个分号后的所有内容。您可以使用sed(1)它,但如果您只提取一行的一部分而不添加任何其他内容,grep通常会更快。

commneeds the input files to be sorted, so sortis used to do that. The output will be sorted. sortuses locale specific collation, so you may need to set LC_ALL=C depending on the exact collation you want.

comm需要对输入文件进行排序,因此sort习惯于这样做。输出将被排序。sort使用特定于语言环境的排序规则,因此您可能需要根据所需的确切排序规则设置 LC_ALL=C。

Note in your question you have www.other-domain in file 2, but www.other-domain.com in file 1. I have assumed that it is a typo in file 2 given the output.

请注意,在您的问题中,文件 2 中有 www.other-domain,但文件 1 中有 www.other-domain.com。我假设根据输出,这是文件 2 中的拼写错误。

This runs all the processes in parallel and streams the file data through them, so even if the files are large, it will not take up a lot of memory or any extra disk space to store temporary files.

这会并行运行所有进程并通过它们流式传输文件数据,因此即使文件很大,也不会占用大量内存或任何额外的磁盘空间来存储临时文件。

回答by tripleee

If the input in file2contains a subset of the contents of file1, you could just

如果输入file2包含 的内容的子集file1,您可以

sed 's/;.*//' file2 | fgrep -vxf - file1 >not-in-file2

The same general idea can be applied to diffor comm. However, commrequires sorted input, but if that is not a problem (or if your data can be sorted to start with), just preprocess the data from file2.

相同的一般思想可以应用于diffcomm。但是,comm需要排序输入,但如果这不是问题(或者如果您的数据可以开始排序),只需预处理file2.

sed 's/;.*//' file2.sorted | comm -12 - file1.sorted >cmp.out

The constraint that input needs to be sorted is what allows commto handle really large files, because it just needs to keep the latest data in memory at any one time. You could do the same with your own custom awkscript.

输入需要排序的约束是允许comm处理非常大的文件,因为它只需要在任何时候将最新数据保存在内存中。您可以使用自己的自定义awk脚本执行相同操作。