java 用Java读取大文件——Java堆空间

Question

提问by user431336

I'm reading a large tsv file (~40G) and trying to prune it by reading line by line and print only certain lines to a new file. However, I keep getting the following exception:

我正在读取一个大的 tsv 文件（~40G）并尝试通过逐行读取并仅将某些行打印到新文件来修剪它。但是，我不断收到以下异常：

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2894)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532)
    at java.lang.StringBuffer.append(StringBuffer.java:323)
    at java.io.BufferedReader.readLine(BufferedReader.java:362)
    at java.io.BufferedReader.readLine(BufferedReader.java:379)

Below is the main part of the code. I specified the buffer size to be 8192 just in case. Doesn't Java clear the buffer once the buffer size limit is reached? I don't see what may cause the large memory usage here. I tried to increase the heap size but it didn't make any difference (machine with 4GB RAM). I also tried flushing the output file every X lines but it didn't help either. I'm thinking maybe I need to make calls to the GC but it doesn't sound right.

下面是代码的主要部分。我将缓冲区大小指定为 8192 以防万一。一旦达到缓冲区大小限制，Java 不会清除缓冲区吗？我看不出是什么导致了大量内存使用。我试图增加堆大小，但没有任何区别（具有 4GB RAM 的机器）。我也尝试过每 X 行刷新一次输出文件，但它也没有帮助。我在想也许我需要调用 GC，但这听起来不对。

Any thoughts? Thanks a lot. BTW - I know I should call trim() only once, store it, and then use it.

有什么想法吗？非常感谢。顺便说一句 - 我知道我应该只调用一次 trim() ，存储它，然后使用它。

Set<String> set = new HashSet<String>();
set.add("A-B");
...
...
static public void main(String[] args) throws Exception
{
   BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile),"UTF-8"), 8192);
   PrintStream output = new PrintStream(outputFile, "UTF-8");

   String line = reader.readLine();
   while(line!=null){
        String[] fields = line.split("\t");
        if( set.contains(fields[0].trim()+"-"+fields[1].trim()) )
            output.println((fields[0].trim()+"-"+fields[1].trim()));

        line = reader.readLine();
   }

output.close();

}

Answer 1

回答by toadaly

Most likely, what's going on is that the file does not have line terminators, and so the reader just keeps growing it's StringBuffer unbounded until it runs out of memory.

最有可能发生的情况是该文件没有行终止符，因此读取器只是不断增长它的 StringBuffer 无界，直到它耗尽内存。

The solution would be to read a fixed number of bytes at a time, using the 'read' method of the reader, and then look for new lines (or other parsing tokens) within the smaller buffer(s).

解决方案是使用读取器的“读取”方法一次读取固定数量的字节，然后在较小的缓冲区中查找新行（或其他解析标记）。

Answer 2

回答by Steve Emmerson

Are you certain the "lines" in the file are separated by newlines?

您确定文件中的“行”由换行符分隔吗？

Answer 3

回答by Stephen C

I have 3 theories:

我有3个理论：

The input file is not UTF-8 but some indeterminate binary format that results in extremely long lines when read as UTF-8.
The file contains some extremely long "lines" ... or no line breaks at all.
Something else is happening in code that you are not showing us; e.g. you are adding new elements to set.

输入文件不是 UTF-8，而是一些不确定的二进制格式，当读取为 UTF-8 时会导致非常长的行。
该文件包含一些非常长的“行”……或者根本没有换行符。
您没有向我们展示的代码中发生了其他事情；例如，您正在向set.

To help diagnose this:

为了帮助诊断：

Use some tool like od(on UNIX / LINUX) to confirm that the input file really contains valid line terminators; i.e. CR, NL, or CR NL.
Use some tool to check that the file is valid UTF-8.
Add a static line counter to your code, and when the application blows up with an OOME, print out the value of the line counter.
Keep track of the longest line seen so far, and print that out as well when you get an OOME.

使用一些工具od（在 UNIX / LINUX 上）来确认输入文件确实包含有效的行终止符；即 CR、NL 或 CR NL。
使用一些工具来检查文件是否是有效的 UTF-8。
在您的代码中添加一个静态行计数器，当应用程序因 OOME 崩溃时，打印出行计数器的值。
跟踪迄今为止看到的最长线路，并在您收到 OOME 时将其打印出来。

For the record, your slightly suboptimal use of trimwill have no bearing on this issue.

作为记录，您略微次优的使用与trim此问题无关。

Answer 4

回答by Nathan Ryan

One possibility is that you are running out of heap space duringa garbage collection. The Hotspot JVM uses a parallel collector by default, which means that your application can possibly allocate objects faster than the collector can reclaim them. I have been able to cause an OutOfMemoryError with supposedly only 10K live (small) objects, by rapidly allocating and discarding.

一种可能性是您在垃圾收集期间耗尽了堆空间。Hotspot JVM 默认使用并行收集器，这意味着您的应用程序分配对象的速度可能比收集器回收它们的速度更快。通过快速分配和丢弃，我已经能够导致 OutOfMemoryError 据称只有 10K 个活动（小）对象。

You can try instead using the old (pre-1.5) serial collector with the option -XX:+UseSerialGC. There are several other "extended" optionsthat you can use to tune collection.

您可以尝试使用带有选项的旧（1.5 之前）串行收集器-XX:+UseSerialGC。还有其他几个“扩展”选项可用于调整集合。

Answer 5

回答by Shaunak

You might want to try removing the String[] fieldsdeclaration out of the loop. As you are creating a new array in every loop. You can just reuse the old one right?

您可能想尝试从String[] fields循环中删除声明。当您在每个循环中创建一个新数组时。你可以重复使用旧的，对吗？

java 用Java读取大文件——Java堆空间

提问by user431336

回答by toadaly

回答by Steve Emmerson

回答by Stephen C

回答by Nathan Ryan

回答by Shaunak

相关推荐

最近更新

标签

java 用Java读取大文件——Java堆空间

提问by user431336

回答by toadaly

回答by Steve Emmerson

回答by Stephen C

回答by Nathan Ryan

回答by Shaunak

相关推荐

java Jsp 可编辑数据网格

java Struts 1.x 与 Struts 2.x

java twitter4j：401 身份验证凭据丢失或不正确

在 Java 中，有一些 URL 解析器吗？

相关推荐

最近更新

标签