使用Java删除文件中的重复行

Question

提问by Monster

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.

作为我正在处理的项目的一部分，我想清理我生成的重复行条目的文件。然而，这些重复通常不会彼此靠近。我想出了一种在 Java 中这样做的方法（它基本上制作了文件的副本，然后使用嵌套的 while 语句将一个文件中的每一行与另一个文件中的每一行进行比较）。问题是我生成的文件非常大而且文本很重（大约 225k 行文本，大约 40 兆字节）。我估计我目前的流程需要 63 小时！这是绝对不能接受的。

I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!

但是，我需要一个集成的解决方案。最好是Java。有任何想法吗？谢谢！

Answer 1

采纳答案by Michael Myers

Hmm... 40 megs seems small enough that you could build a Setof the lines and then print them all back out. This would be way, way faster than doing O(n²) I/O work.

嗯... 40 兆似乎足够小，您可以构建一条Set线，然后将它们全部打印出来。这比做 O(n ²) I/O 工作要快得多。

It would be something like this (ignoring exceptions):

它会是这样的（忽略异常）：

public void stripDuplicatesFromFile(String filename) {
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
    String line;
    while ((line = reader.readLine()) != null) {
        lines.add(line);
    }
    reader.close();
    BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
    for (String unique : lines) {
        writer.write(unique);
        writer.newLine();
    }
    writer.close();
}

If the order is important, you could use a LinkedHashSetinstead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.

如果顺序很重要，您可以使用 aLinkedHashSet代替 a HashSet。由于元素是通过引用存储的，与实际数据量相比，额外链表的开销应该是微不足道的。

Edit:As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSetinstead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.

编辑：正如 Workshop Alex 所指出的，如果您不介意制作一个临时文件，您可以在阅读时简单地打印出这些行。这允许您使用简单的HashSet而不是LinkedHashSet. 但我怀疑您是否会注意到像这样的 I/O 绑定操作的不同之处。

Answer 2

回答by brabster

You could use Set in the Collections library to store unique, seen values as you read the file.

您可以使用集合库中的 Set 来存储读取文件时的唯一可见值。

Set<String> uniqueStrings = new HashSet<String>();

// read your file, looping on newline, putting each line into variable 'thisLine'

    uniqueStrings.add(thisLine);

// finish read

for (String uniqueString:uniqueStrings) {
  // do your processing for each unique String
  // i.e. System.out.println(uniqueString);
}

Answer 3

回答by Kevin Dungs

Try a simple HashSet that stores the lines you have already read. Then iterate over the file. If you come across duplicates they are simply ignored (as a Set can only contain every element once).

尝试一个简单的 HashSet 来存储您已经阅读的行。然后遍历文件。如果您遇到重复项，它们将被简单地忽略（因为 Set 只能包含每个元素一次）。

Answer 4

回答by gustafc

Something like this, perhaps:

像这样的事情，也许：

BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
    lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
    out.println(line);

LinkedHashSetkeeps the insertion order, as opposed to HashSetwhich (while being slightly faster for lookup/insert) will reorder all lines.

LinkedHashSet保持插入顺序，而不是HashSet（虽然查找/插入速度稍快）将重新排序所有行。

Answer 5

回答by fortran

The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.

Hash Set 方法没问题，但您可以调整它，使其不必将所有字符串存储在内存中，而是将逻辑指针存储在文件中的位置，以便您可以仅在需要时返回读取实际值。

Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.

另一种创造性的方法是将行号附加到每一行，然后对所有行进行排序，删除重复项（忽略应该是数字的最后一个标记），然后根据最后一个标记再次对文件进行排序并将其剥离在输出中。

Answer 6

回答by samoz

If you could use UNIX shell commands you could do something like the following:

如果您可以使用 UNIX shell 命令，您可以执行以下操作：

for(i = line 0 to end)
{
    sed 's/$i//2g' ; deletes all repeats
}

This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.

这将遍历您的整个文件，并且每次 sed 调用仅传递每个唯一的事件一次。这样，您就不必进行以前做过的大量搜索。

Answer 7

回答by Simon Nickerson

Read in the file, storing the line number and the line: O(n)
Sort it into alphabetical order: O(n log n)
Remove duplicates: O(n)
Sort it into its original line number order: O(n log n)

读入文件，存储行号和行：O(n)
按字母顺序排序：O(n log n)
删除重复项：O(n)
将其排序为其原始行号顺序：O(n log n)

Answer 8

回答by user44242

There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.

有两种可扩展的解决方案，其中可扩展是指磁盘而不是基于内存，这取决于过程是否稳定，其中稳定是指删除重复项后的顺序是相同的。如果可扩展性不是问题，那么只需将内存用于相同类型的方法。

For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.

对于不稳定的解决方案，首先对磁盘上的文件进行排序。这是通过将文件拆分为较小的文件，在内存中对较小的块进行排序，然后按排序顺序合并文件来完成的，其中合并会忽略重复项。

The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.

通过只比较每个文件中的当前行，合并本身几乎可以不使用内存来完成，因为保证下一行更大。

The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing the result, just the line numbers to be deleted.

稳定的解决方案稍微有点棘手。首先，像以前一样对文件进行分块排序，但在每一行中指明原始行号。然后，在“合并”期间不要费心存储结果，只需删除要删除的行号。

Then copy the original file line by line, ignoring the line numbers you have stored above.

然后逐行复制原始文件，忽略上面存储的行号。

Answer 9

回答by phihag

If the order does not matter, the simplest way is shell scripting:

如果顺序无关紧要，最简单的方法是编写 shell 脚本：

<infile sort | uniq > outfile

Answer 10

回答by mikek

Does it matter in which order the lines come, and how many duplicates are you counting on seeing?

这些行以何种顺序出现以及您指望看到多少重复是否重要？

If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizingthe hashset solution, with the hashset as a shared resource.

如果没有，并且如果您指望大量重复（即读多于写），我还会考虑并行化 hashset 解决方案，将 hashset 作为共享资源。

使用Java删除文件中的重复行

提问by Monster

采纳答案by Michael Myers

回答by brabster

回答by Kevin Dungs

回答by gustafc

回答by fortran

回答by samoz

回答by Simon Nickerson

回答by user44242

回答by phihag

回答by mikek

相关推荐

最近更新

标签

使用Java删除文件中的重复行

提问by Monster

采纳答案by Michael Myers

回答by brabster

回答by Kevin Dungs

回答by gustafc

回答by fortran

回答by samoz

回答by Simon Nickerson

回答by user44242

回答by phihag

回答by mikek

相关推荐

Java OpenCV 校正轮廓

Java 256 位 AES 基于密码的加密

Java 蚂蚁：build.xml; XML 文档结构必须在同一实体内开始和结束

Java：将集合转换为字符串表示的数组

相关推荐

最近更新

标签