java 如何比较大文本文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7107517/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 18:39:35  来源:igfitidea点击:

How to compare large text files?

javafilecomparison

提问by Grrace

I have a general question on your opinion about my "technique".

我对你对我的“技术”的看法有一个普遍的问题。

There are 2 textfiles (file_1and file_2) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each). My idea is to read several lines (as many as possible) of file_1to the memory, then compare those to alllines of file_2. If there's a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1and also compare those to alllines of file_2until I went through file_1completely.

有 2 个文本文件(file_1file_2)需要相互比较。两者都非常大(3-4 GB,每个有 30,000,000 到 45,000,000 行)。我的想法是读几行(尽可能多)的file_1到内存中,然后比较这些到所有的线路file_2。如果匹配,则将两个文件中匹配的行写入新文件。然后继续接下来的 1000 行,file_1并将它们与所有行进行比较,file_2直到我file_1完全完成。

But this sounds actually really, really time consuming and complicated to me. Can you think of any other method to compare those two files?

但这对我来说实际上非常非常耗时且复杂。你能想出任何其他方法来比较这两个文件吗?

How long do you think the comparison could take? For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn't take more than a day though. ;-) But I am afraid my technique could take forever...

您认为比较需要多长时间?对于我的程序,时间并不重要。我没有处理如此大文件的经验,因此我不知道这可能需要多长时间。不过应该不会超过一天。;-) 但恐怕我的技术可能需要永远......

Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it? I want to read as many as possible (because I think that's faster) but I've ran out of memory quite often.

我刚刚想到的另一个问题是:你会把多少行读入内存?越多越好?有没有办法在实际尝试之前确定可能的行数?我想阅读尽可能多的内容(因为我认为那样会更快),但我经常内存不足。

Thanks in advance.

提前致谢。

EDITI think I have to explain my problem a bit more.

编辑我想我必须多解释一下我的问题。

The purpose is not to see if the two files in general are identical (they are not). There are some lines in each file that share the same "characteristic". Here's an example: file_1looks somewhat like this:

目的不是要查看这两个文件通常是否相同(它们不是)。每个文件中有一些行共享相同的“特征”。这是一个例子: file_1看起来有点像这样:

mat1 1000 2000 TEXT      //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT

file_2looks like this:

file_2看起来像这样:

mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT

TEXTrefers to characters and digits that are of no interest for me, matcan go from mat1 - mat50and are in no order; also there can be 1000x mat2(but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2fits into the range mentioned in file_1. So in my example I would find one match: line 3 of file_1and line 1 of file_2(because both are mat3 and 10009 is between 10000 and 10010). I hope this makes it clear to you!

TEXT指的是我不感兴趣的字符和数字,mat可以从mat1 - mat50并且没有顺序;也可以有 1000 倍mat2(但下一列中的数字不同)。我需要以某种方式找到拟合线: matX 在两条比较线中都相同,并且 中提到的数字file_2适合 中提到的范围file_1。因此,在我的示例中,我会找到一个匹配项:第 3 行file_1和第 1 行file_2(因为两者都是 mat3 并且 10009 介于 10000 和 10010 之间)。我希望这能让你清楚!

So my question is: how would you search for the matching lines?

所以我的问题是:您将如何搜索匹配的行?

Yes, I use Java as my programming language.

是的,我使用 Java 作为我的编程语言。

EDITI now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning ;-) Nonentheless all your approaches were very helpful to me, thank you for your replies!

编辑我现在首先分割大文件,这样我就不会出现内存不足的问题。我还认为将(许多)较小的文件相互比较比这两个大文件更快。之后,我可以按照上面提到的方式比较它们。这可能不是完美的方式,但我仍在学习;-) 尽管如此,您的所有方法对我都非常有帮助,感谢您的回复!

采纳答案by Alistair A. Israel

Now that you've given us more specifics, the approach I would take relies upon pre-partitioning, and optionally, sorting before searching for matches.

既然您已经向我们提供了更多细节,我将采用的方法依赖于预分区,并且可以选择在搜索匹配项之前进行排序。

This should eliminate a substantial amount of comparisons that wouldn't otherwise match anyway in the naive, brute-force approach. For the sake of argument, lets peg both files at 40 million lines each.

这应该消除大量的比较,否则这些比较在天真的蛮力方法中无论如何都不会匹配。为了便于论证,让我们将两个文件分别固定在 4000 万行。

Partitioning:Read through file_1and send all lines starting with mat1to file_1_mat1, and so on. Do the same for file_2. This is trivial with a little grep, or should you wish to do it programmatically in Java it's a beginner's exercise.

分区:通读file_1并发送以mat1to开头的所有行file_1_mat1,依此类推。对file_2. 这对于一点点来说是微不足道的grep,或者如果您希望在 Java 中以编程方式进行,这是初学者的练习。

That's one pass through two files for a total of 80million lines read, yielding two sets of 50 files of 800,000 lines each on average.

这是一次遍历两个文件,总共读取 8000 万行,产生两组 50 个文件,每个文件平均 800,000 行。

Sorting:For each partition, sort according to the numeric value in the second column only (the lower bound from file_1and the actual number from file_2). Even if 800,000 lines can't fit into memory I suppose we can adapt 2-way external merge sort and perform this faster (fewer overall reads) than a sort of the entireunpartitioned space.

排序:对于每个分区,仅根据第二列的数值(下界 fromfile_1和实际数字 from file_2)进行排序。即使 800,000 行无法放入内存,我想我们也可以采用 2 路外部归并排序,并且比整个未分区空间的排序执行得更快(总体读取更少)。

Comparison:Now you just have to iterate oncethrough both pairs of file_1_mat1and file_2_mat1, without need to keep anything in memory, outputting matches to your output file. Repeat for the rest of the partitions in turn. No need for a final 'merge' step (unless you're processing partitions in parallel).

比较:现在您只需要在和对中迭代一次,无需在内存中保留任何内容,将匹配输出到您的输出文件。依次对其余分区重复此操作。不需要最后的“合并”步骤(除非您并行处理分区)。file_1_mat1file_2_mat1

Even without the sorting stage the naive comparison you're already doing should work faster across 50 pairs of files with 800,000 lines each rather than with two files with 40 million lines each.

即使没有排序阶段,您已经在进行的简单比较应该在 50 对文件中更快地工作,每个文件 800,000 行,而不是两个文件,每个文件 4000 万行。

回答by epochengine

In an ideal world, you would be able to read in every line of file_2 into memory (probably using a fast lookup object like a HashSet, depending on your needs), then read in each line from file_1 one at a time and compare it to your data structure holding the lines from file_2.

在理想的世界中,您将能够将 file_2 的每一行读入内存(可能使用像 a 这样的快速查找对象HashSet,具体取决于您的需要),然后一次从 file_1 读取每一行并将其与您的保存来自 file_2 的行的数据结构。

As you have said you run out of memory however, I think a divide-and-conquer type strategy would be best. You could use the same method as I mentioned above, but read in a half (or a third, a quarter... depending on how much memory you can use) of the lines from file_2 and store them, then compare all of the lines in file_1. Then read in the next half/third/quarter/whatever into memory (replacing the old lines) and go through file_1 again. It means you have to go through file_1 more, but you have to work with your memory constraints.

正如您所说,您的内存不足,我认为分而治之的策略是最好的。您可以使用与我上面提到的相同的方法,但从 file_2 中读取一半(或三分之一、四分之一……取决于您可以使用多少内存)的行并存储它们,然后比较所有行在文件_1 中。然后将下一半/第三/四分之一/任何内容读入内存(替换旧行)并再次通过 file_1。这意味着您必须更多地通过 file_1,但您必须处理您的内存限制。



EDIT:In response to the added detail in your question, I would change my answer in part. Instead of reading in all of file_2 (or in chunks) and reading in file_1 a line at a time, reverse that, as file_1 holds the data to check against.

编辑:为了回应您问题中添加的细节,我会部分更改我的答案。而不是读取所有的 file_2(或块)并一次读取 file_1 一行,相反,因为 file_1 保存要检查的数据。

Also, with regards searching the matching lines. I think the best way would be to do some processing on file_1. Create a HashMap<List<Range>>that maps a String ("mat1" - "mat50") to a list of Ranges (just a wrapper for a startOfRange intand an endOfRange int) and populate it with the data from file_1. Then write a function like (ignoring error checking)

此外,关于搜索匹配的行。我认为最好的方法是对 file_1 进行一些处理。创建一个HashMap<List<Range>>将 String ("mat1" - "mat50") 映射到Ranges列表(只是 startOfRangeint和 endOfRange的包装器int)并用 file_1 中的数据填充它的。然后写一个函数(忽略错误检查)

boolean isInRange(String material, int value)
{
    List<Range> ranges = hashMapName.get(material);
    for (Range range : ranges)
    {
        if (value >= range.getStart() && value <= range.getEnd())
        {
            return true;
        }
    }
    return false;
}

and call it for each (parsed) line of file_2.

并为 file_2 的每个(已解析)行调用它。

回答by BegemoT

I think, your way is rather reasonable.

我觉得,你的做法比较合理。

I can imagine different strategies -- for example, you can sort both files before compare (where is efficient implementation of filesort, and unix sort utility can sort several Gbs files in minutes), and, while sorted, you can compare files sequentally, reading line by line.

我可以想象不同的策略——例如,您可以在比较之前对两个文件进行排序(filesort 的有效实现在哪里,而 unix sort 实用程序可以在几分钟内对几个 Gbs 文件进行排序),并且在排序时,您可以依次比较文件,阅读逐行。

But this is rather complex way to go -- you need to run external program (sort), or write comparable efficient implementation of filesort in java by yourself -- which is by itself not an easy task. So, for the sake of simplicity, I think you way of chunked read is very promising;

但这是一种相当复杂的方法——您需要运行外部程序(排序),或者自己在 java 中编写文件排序的类似高效实现——这本身并不是一件容易的事。所以,为了简单起见,我认为你的分块阅读方式非常有前途;

As for how to find reasonable block -- first of all, it may not be correct what "the more -- the better" -- I think, time of all work will grow asymptotically, to some constant line. So, may be you'll be close to that line faster then you think -- you need benchmark for this.

至于如何找到合理的块——首先,“越多越好”可能是不正确的——我认为,所有工作的时间都会逐渐增长,达到某个恒定的线。因此,您可能会比您想象的更快地接近那条线 - 您需要为此进行基准测试。

Next -- you may read lines to buffer like this:

接下来——您可以像这样读取要缓冲的行:

final List<String> lines = new ArrayList<>();
try{
    final List<String> block = new ArrayList<>(BLOCK_SIZE);
    for(int i=0;i<BLOCK_SIZE;i++){
       final String line = ...;//read line from file
       block.add(line);
    }
    lines.addAll(block); 
}catch(OutOfMemory ooe){
    //break
}

So you read as many lines, as you can -- leaving last BLOCK_SIZE of free memory. BLOCK_SIZE should be big enouth to the rest of you program to run without OOM

所以你尽可能多地阅读行——留下最后 BLOCK_SIZE 的可用内存。BLOCK_SIZE 对于其他程序来说应该足够大以在没有 OOM 的情况下运行

回答by Mariy

If you want to know exactly if the files are different or not then there isn't a better solution than yours -- comparing sequentially.

如果您想确切地知道文件是否不同,那么没有比您更好的解决方案 - 按顺序比较。

However you can make some heuristics that can tell you with some kind of probability if the files are identical. 1) Check file size; that's the easiest. 2) Take a random file position and compare block of bytes starting at this position in the two files. 3) Repeat step 2) to achieve the needed probability.

但是,您可以进行一些启发式方法,以某种概率告诉您文件是否相同。1) 检查文件大小;这是最简单的。2) 取一个随机文件位置并比较两个文件中从该位置开始的字节块。3) 重复步骤 2) 以达到所需的概率。

You should compute and test how many reads (and size of block) are useful for your program.

您应该计算并测试对您的程序有用的读取次数(和块大小)。

回答by Mike Houston

My solution would be to produce an index of one file first, then use that to do the comparison. This is similar to some of the other answers in that it uses hashing.

我的解决方案是先生成一个文件的索引,然后用它来进行比较。这类似于其他一些答案,因为它使用散列。

You mention that the number of lines is up to about 45 million. This means that you could (potentially) store an index which uses 16 bytes per entry (128 bits) and it would use about 45,000,000*16 = ~685MB of RAM, which isn't unreasonable on a modern system. There are overheads in using the solution I describe below, so you might still find you need to use other techniques such as memory mapped files or disk based tables to create the index. See Hypertableor HBasefor an example of how to store the index in a fast disk-based hash table.

你提到的行数高达4500万左右。这意味着您可以(潜在地)存储每个条目使用 16 个字节(128 位)的索引,它将使用大约 45,000,000*16 = ~685MB 的 RAM,这在现代系统上并非不合理。使用我在下面描述的解决方案会产生开销,因此您可能仍会发现需要使用其他技术(例如内存映射文件或基于磁盘的表)来创建索引。有关如何将索引存储在基于磁盘的快速哈希表中的示例,请参阅HypertableHBase

So, in full, the algorithm would be something like:

因此,总的来说,该算法将类似于:

  1. Create a hash map which maps Long to a List of Longs (HashMap<Long, List<Long>>)
  2. Get the hash of each line in the first file (Object.hashCode should be sufficient)
  3. Get the offset in the file of the line so you can find it again later
  4. Add the offset to the list of lines with matching hashCodes in the hash map
  5. Compare each line of the second file to the set of line offsets in the index
  6. Keep any lines which have matching entries
  1. 创建一个将 Long 映射到 Long 列表的哈希映射 (HashMap<Long, List<Long>>)
  2. 获取第一个文件中每一行的hash(Object.hashCode应该就够了)
  3. 获取该行文件中的偏移量,以便稍后再次找到它
  4. 将偏移量添加到哈希映射中具有匹配哈希码的行列表中
  5. 将第二个文件的每一行与索引中的一组行偏移量进行比较
  6. 保留任何具有匹配条目的行

EDIT:In response to your edited question, this wouldn't really help in itself. You could just hash the first part of the line, but it would only create 50 different entries. You could then create another level in the data structure though, which would map the start of each range to the offset of the line it came from.

编辑:为了回答您编辑的问题,这本身并没有真正的帮助。您可以只散列该行的第一部分,但它只会创建 50 个不同的条目。然后您可以在数据结构中创建另一个级别,它将每个范围的开始映射到它来自的行的偏移​​量。

So something like index.get("mat32")would return a TreeMap of ranges. You could look for the range preceding the value you are looking for lowerEntry(). Together this would give you a pretty fast check to see if a given matX/number combination was in one of the ranges you are checking for.

所以类似的东西index.get("mat32")会返回一个范围的 TreeMap 。您可以查找您要查找的值之前的范围lowerEntry()。总之,这将为您提供非常快速的检查,以查看给定的 matX/数字组合是否在您要检查的范围之一内。

回答by Ingo

Indeed, that could take a while. You have to make 1,200.000,000 line comparisions. There are several possibilities to speed that up by an order of magnitute:

确实,这可能需要一段时间。您必须进行 1,200.000,000 行比较。有几种可能性可以将其加速一个数量级:

One would be to sort file2 and do kind of a binary search on file level. Another approach: compute a checksum of each line, and search that. Depending on average line length, the file in question would be much smaller and you really can do a binary search if you store the checksums in a fixed format (i.e. a long)

一种方法是对 file2 进行排序并在文件级别进行某种二进制搜索。另一种方法:计算每行的校验和,然后搜索。根据平均行长度,有问题的文件会小得多,如果您以固定格式(即长)存储校验和,您确实可以进行二进制搜索

The number of lines you read at once from file_1 does notmatter, however. This is micro-optimization in the face of great complexity.

行数你读从file_1一次确实没有事,但是。这是面对巨大复杂性的微观优化。

回答by duedl0r

If you want a simple approach: you can hash both of the files and compare the hash. But it's probably faster (especially if the files differ) to use your approach. About the memory consumption: just make sure you use enough memory, using no buffer for this kind a thing is a bad idea..

如果你想要一个简单的方法:你可以散列两个文件并比较散列。但是使用您的方法可能会更快(特别是如果文件不同)。关于内存消耗:只需确保使用足够的内存,这种情况下不使用缓冲区是一个坏主意..

And all those answers about hashes, checksums etc: those are not faster. You have to read the whole file in both cases. With hashes/checksums you even have to compute something...

以及所有关于散列、校验和等的答案:那些并没有更快。在这两种情况下,您都必须阅读整个文件。使用哈希/校验和,你甚至必须计算一些东西......

回答by Peter Lawrey

What you can do is sort each individual file. e.g. the UNIX sortor similar in Java. You can read the sorted files one line at a time to perform a merge sort.

您可以做的是对每个单独的文件进行排序。例如 UNIXsort或 Java 中的类似工具。您可以一次读取一行已排序的文件以执行合并排序。

回答by amit

there is a tradeoff: if you read a big chunk of the file, you save the disc seek time, but you may have read information you will not need, since the change was encountered on the first lines.

有一个权衡:如果你读了一大块文件,你节省了光盘寻道时间,但你可能读到了你不需要的信息,因为在第一行遇到了变化。

You should probably run some experiments [benchmarks], with varying chunk size, to find out what is the optimal chunk to read, in the average case.

您可能应该运行一些实验 [benchmarks],使用不同的块大小,以找出在平均情况下读取的最佳块。

回答by sealz

I have never worked with such huge files but this is my idea and should work.

我从来没有处理过这么大的文件,但这是我的想法,应该可行。

You could look into hash. Using SHA-1 Hashing.

你可以看看哈希。使用 SHA-1 哈希。

Import the following

导入以下内容

import java.io.FileInputStream;
import java.security.MessageDigest;

Once your text file etc has been loaded have it loop through each line and at the end print out the hash. The example links below will go into more depth.

一旦你的文本文件等被加载,它就会遍历每一行,最后打印出散列。下面的示例链接将更深入。

StringBuffer myBuffer = new StringBuffer("");
//For each line loop through
    for (int i = 0; i < mdbytes.length; i++) {
        myBuffer.append(Integer.toString((mdbytes[i] & 0xff) + 0x100, 16).substring(1));
    }
System.out.println("Computed Hash = " + sb.toString());

SHA Code example focusing on Text File

关注文本文件的 SHA 代码示例

SO Question about computing SHA in JAVA (Possibly helpful)

SO 关于在 JAVA 中计算 SHA 的问题(可能有帮助)

Another sample of hashing code.

另一个散列代码示例。

Simple read each file seperatley, if the hash value for each file is the same at the end of the process then the two files are identical. If not then something is wrong.

简单地单独读取每个文件,如果每个文件的哈希值在进程结束时相同,那么这两个文件是相同的。如果没有,那么就出问题了。

Then if you get a different value you can do the super time consuming line by line check.

然后如果你得到不同的值,你可以逐行检查超级耗时。

Overall, It seems that reading line by line by line by line etc would take forever. I would do this if you are trying to find each individual difference. But I think hashing would be quicker to see if they are the same.

总体而言,似乎逐行阅读等需要永远。如果您想找出每个人的差异,我会这样做。但我认为散列会更快地查看它们是否相同。

SHA checksum

SHA校验和