用Java读取大文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2356137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 06:29:53  来源:igfitidea点击:

Read large files in Java

javamemory-managementfile

提问by CC.

I need the advice from someone who knows Java very well and the memory issues. I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.

我需要非常了解 Java 和内存问题的人的建议。我有一个大文件(大约 1.5GB),我需要将这个文件切成许多(例如 100 个小文件)较小的文件。

I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.

我通常知道如何做到这一点(使用 a BufferedReader),但我想知道您是否对内存有任何建议,或者提示如何更快地做到这一点。

My file contains text, it is not binary and I have about 20 character per line.

我的文件包含文本,它不是二进制文件,每行大约有 20 个字符。

采纳答案by Michael Borgwardt

First, if your file contains binary data, then using BufferedReaderwould be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStreaminstead. If it's text data and you need to split it along linebreaks, then using BufferedReaderis OK (assuming the file contains lines of a sensible length).

首先,如果您的文件包含二进制数据,那么使用BufferedReader将是一个大错误(因为您会将数据转换为字符串,这是不必要的并且很容易损坏数据);你应该使用 aBufferedInputStream代替。如果它是文本数据并且您需要沿换行符拆分它,那么使用BufferedReader是可以的(假设文件包含合理长度的行)。

Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).

关于内存,如果您使用大小合适的缓冲区应该没有任何问题(我会使用至少 1MB 来确保 HD 主要进行顺序读取和写入)。

If speed turns out to be a problem, you could have a look at the java.niopackages - those are supposedly faster than java.io,

如果结果证明速度是个问题,您可以查看java.nio软件包 - 据说它们比java.io

回答by Kartoch

You can use java.nio which is faster than classical Input/Output stream:

您可以使用比经典输入/输出流更快的 java.nio:

http://java.sun.com/javase/6/docs/technotes/guides/io/index.html

http://java.sun.com/javase/6/docs/technotes/guides/io/index.html

回答by BalusC

To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediatelyas soon as the input comes in.

为了节省内存,不要在内存中不必要地存储/复制数据(即不要将它们分配给循环外的变量)。只要输入进来就立即处理输出。

It really doesn't matter whether you're using BufferedReaderor not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.

无论您是否使用BufferedReader,都没有关系。它不会像一些隐含暗示的那样花费更多的内存。它最多只会影响性能的百分之几。这同样适用于使用 NIO。它只会提高可扩展性,而不是内存使用。只有当您在同一个文件上运行数百个线程时,它才会变得有趣。

Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.

只需遍历文件,在读入时立即将每一行写入其他文件,计算行数,如果达到 100,则切换到下一个文件,依此类推。

Kickoff example:

开场示例:

String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;

try {
    reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
    int count = 0;
    for (String line; (line = reader.readLine()) != null;) {
        if (count++ % maxlines == 0) {
            close(writer);
            writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
        }
        writer.write(line);
        writer.newLine();
    }
} finally {
    close(writer);
    close(reader);
}

回答by oneat

Don't use read without arguments. It's very slow. Better read it to buffer and move it to file quickly.

不要在没有参数的情况下使用 read。这很慢。最好将其读取到缓冲区并快速将其移动到文件中。

Use bufferedInputStream because it supports binary reading.

使用 bufferedInputStream 因为它支持二进制读取。

And it's all.

这就是全部。

回答by b.roth

This is a very good article: http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

这是一篇非常好的文章:http: //java.sun.com/developer/technicalArticles/Programming/PerfTuning/

In summary, for great performance, you should:

总之,要获得出色的性能,您应该:

  1. Avoid accessing the disk.
  2. Avoid accessing the underlying operating system.
  3. Avoid method calls.
  4. Avoid processing bytes and characters individually.
  1. 避免访问磁盘。
  2. 避免访问底层操作系统。
  3. 避免方法调用。
  4. 避免单独处理字节和字符。

For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.

例如,要减少对磁盘的访问,可以使用大缓冲区。文章描述了各种方法。

回答by Ryan Emerle

You can consider using memory-mapped files, via FileChannels .

您可以考虑通过FileChannels使用内存映射文件。

Generally a lotfaster for large files. There are performance trade-offs that couldmake it slower, so YMMV.

通常对于大文件快得多。有一些性能权衡可能会使其变慢,所以 YMMV。

Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness

相关答案:Java NIO FileChannel 与 FileOutputstream 性能/实用性

回答by Mike

Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.

它必须在Java中完成吗?即它需要独立于平台吗?如果没有,我建议在 *nix 中使用“ split”命令。如果你真的想要,你可以通过你的 java 程序执行这个命令。虽然我没有测试过,但我认为它的执行速度比您能想到的任何 Java IO 实现都快。

回答by Thorbj?rn Ravn Andersen

Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.

除非您不小心读取了整个输入文件而不是逐行读取,否则您的主要限制将是磁盘速度。您可能想尝试从一个包含 100 行的文件开始,然后将其写入 100 个不同的文件,每个文件一行,并使触发机制根据写入当前文件的行数工作。该程序将很容易根据您的情况进行扩展。

回答by Namalak

Yes. I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file (Eg : read(buffer,0,buffer.length))

是的。我还认为将 read() 与 read(Char[], int init, int end) 之类的参数一起使用是读取如此大文件的更好方法(例如:read(buffer,0,buffer.length))

And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.

而且我还遇到了在二进制数据输入流中使用 BufferedReader 而不是 BufferedInputStreamReader 时缺失值的问题。因此,在这种情况下,使用 BufferedInputStreamReader 会好得多。

回答by RAM

package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;

/**
 * @author Naresh Bhabat
 * 
Following  implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.


Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.



It uses random access file,which is almost like streaming API.


 * ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);

 *      for 10 threads:Total time required for reading and writing the text in
 *         :seconds 349.317
 * 
 *         For 100:Total time required for reading the text and writing   : seconds 464.042
 * 
 *         For 1000 : Total time required for reading and writing text :466.538 
 *         For 10000  Total time required for reading and writing in seconds 479.701
 *
 * 
 */
public class DealWithHugeRecordsinFile extends TestCase {

 static final String FILEPATH = "C:\springbatch\bigfile1.txt.txt";
 static final String FILEPATH_WRITE = "C:\springbatch\writinghere.txt";
 static volatile RandomAccessFile fileToWrite;
 static volatile RandomAccessFile file;
 static volatile String fileContentsIter;
 static volatile int position = 0;

 public static void main(String[] args) throws IOException, InterruptedException {
  long currentTimeMillis = System.currentTimeMillis();

  try {
   fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles 
   file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles 
   seriouslyReadProcessAndWriteAsynch();

  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  Thread currentThread = Thread.currentThread();
  System.out.println(currentThread.getName());
  long currentTimeMillis2 = System.currentTimeMillis();
  double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
  System.out.println("Total time required for reading the text in seconds " + time_seconds);

 }

 /**
  * @throws IOException
  * Something  asynchronously serious
  */
 public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
  ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
  while (true) {
   String readLine = file.readLine();
   if (readLine == null) {
    break;
   }
   Runnable genuineWorker = new Runnable() {
    @Override
    public void run() {
     // do hard processing here in this thread,i have consumed
     // some time and ignore some exception in write method.
     writeToFile(FILEPATH_WRITE, readLine);
     // System.out.println(" :" +
     // Thread.currentThread().getName());

    }
   };
   executor.execute(genuineWorker);
  }
  executor.shutdown();
  while (!executor.isTerminated()) {
  }
  System.out.println("Finished all threads");
  file.close();
  fileToWrite.close();
 }

 /**
  * @param filePath
  * @param data
  * @param position
  */
 private static void writeToFile(String filePath, String data) {
  try {
   // fileToWrite.seek(position);
   data = "\n" + data;
   if (!data.contains("Randomization")) {
    return;
   }
   System.out.println("Let us do something time consuming to make this thread busy"+(position++) + "   :" + data);
   System.out.println("Lets consume through this loop");
   int i=1000;
   while(i>0){
   
    i--;
   }
   fileToWrite.write(data.getBytes());
   throw new Exception();
  } catch (Exception exception) {
   System.out.println("exception was thrown but still we are able to proceeed further"
     + " \n This can be used for marking failure of the records");
   //exception.printStackTrace();

  }

 }
}