Linux 如何比较两个大文件并获得第三个文件的结果?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/10831534/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to compare two big files and get results to third file?
提问by Martin Mocik
I have two files
我有两个文件
1st file is like this:
第一个文件是这样的:
www.example.com
www.domain.com
www.otherexample.com
www.other-domain.com
www.other-example.com
www.exa-ample.com
2nd file is like this (numbers after ;;; are between 0-10):
第二个文件是这样的(;;;后面的数字在0-10之间):
www.example.com;;;2
www.domain.com;;;5
www.other-domain;;;0
www.exa-ample.com;;;4
and i want compare these two files and output to third file like this:
我想比较这两个文件并输出到第三个文件,如下所示:
www.otherexample.com
www.other-example.com
Both files have large size (over 500mb)
两个文件都很大(超过 500mb)
回答by Roman Newaza
You can use:
您可以使用:
$ diff file1 file2 > file3
But it seams to me you want to disregard ;;0
part, right?
Then you need to process it line by line stripping the last part, and, finally, comparing with diff
但我觉得你想忽略;;0
部分,对吧?然后你需要逐行处理它剥离最后一部分,最后,与diff
回答by Levon
You could use the diffcommand and direct the output to a 3 third file. E.g.,?
您可以使用diff命令并将输出定向到第三个文件。例如,?
% diff data1.txt data2.txt > diffs
The diff man pageshows a number of options that give you control over the comparison (processing and output).
该DIFF man页面显示了许多,让您控制的比较(处理和输出)选项。
The basic interactive operation without specifying an options, assuming you have the data you show in your post in files data1.txt
and data2.txt
yields:
不指定选项的基本交互操作,假设您在文件中显示了您在帖子中显示的数据data1.txt
并data2.txt
产生:
% diff data1.txt data2.txt
1,6d0
< www.example.com
< www.domain.com
< www.otherexample.com
< www.other-domain.com
< www.other-example.com
< www.exa-ample.com
回答by Alessandro Pezzato
If a
is the file with the first content and b
is the file with the second content:
如果a
是包含第一个内容b
的文件和包含第二个内容的文件:
while read line; do grep -q $line b || echo $line; done < a
It prints what is not found in the second file.
它打印在第二个文件中找不到的内容。
回答by camh
Use comm(1)
to compare two sorted files and to give the differences. Use grep(1)
and sort(1)
to get your files into an input format suitable for comparison with comm
. Use process substitutionin bash
to tie it together:
使用comm(1)
比较两个排序的文件,并给予差异。使用grep(1)
和sort(1)
将文件转换为适合与comm
. 使用过程中替换的bash
共同比分扳成:
comm -23 <(sort file1.txt) <(grep -o '^[^;]*' file2.txt | sort)
The -23
argument to comm
says to ignore lines that are common to both files (-3
) and lines unique to file 2 (-2
). Depending on your exact specification, you can use -1
, -2
or -3
.
的-23
参数comm
表示忽略文件 ( -3
) 和文件 2 独有的行( )共有的行-2
。根据您的确切规格,您可以使用-1
,-2
或-3
。
grep -o '^[^;]*' file2.txt
just strips off everything after the first semicolon. You can use sed(1)
for this, but if you are only extracting part of a line and not adding anything else, grep
will often be faster.
grep -o '^[^;]*' file2.txt
只是去掉第一个分号后的所有内容。您可以使用sed(1)
它,但如果您只提取一行的一部分而不添加任何其他内容,grep
通常会更快。
comm
needs the input files to be sorted, so sort
is used to do that. The output will be sorted. sort
uses locale specific collation, so you may need to set LC_ALL=C depending on the exact collation you want.
comm
需要对输入文件进行排序,因此sort
习惯于这样做。输出将被排序。sort
使用特定于语言环境的排序规则,因此您可能需要根据所需的确切排序规则设置 LC_ALL=C。
Note in your question you have www.other-domain in file 2, but www.other-domain.com in file 1. I have assumed that it is a typo in file 2 given the output.
请注意,在您的问题中,文件 2 中有 www.other-domain,但文件 1 中有 www.other-domain.com。我假设根据输出,这是文件 2 中的拼写错误。
This runs all the processes in parallel and streams the file data through them, so even if the files are large, it will not take up a lot of memory or any extra disk space to store temporary files.
这会并行运行所有进程并通过它们流式传输文件数据,因此即使文件很大,也不会占用大量内存或任何额外的磁盘空间来存储临时文件。
回答by tripleee
If the input in file2
contains a subset of the contents of file1
, you could just
如果输入file2
包含 的内容的子集file1
,您可以
sed 's/;.*//' file2 | fgrep -vxf - file1 >not-in-file2
The same general idea can be applied to diff
or comm
. However, comm
requires sorted input, but if that is not a problem (or if your data can be sorted to start with), just preprocess the data from file2
.
相同的一般思想可以应用于diff
或comm
。但是,comm
需要排序输入,但如果这不是问题(或者如果您的数据可以开始排序),只需预处理file2
.
sed 's/;.*//' file2.sorted | comm -12 - file1.sorted >cmp.out
The constraint that input needs to be sorted is what allows comm
to handle really large files, because it just needs to keep the latest data in memory at any one time. You could do the same with your own custom awk
script.
输入需要排序的约束是允许comm
处理非常大的文件,因为它只需要在任何时候将最新数据保存在内存中。您可以使用自己的自定义awk
脚本执行相同操作。