用Java将整数数组写入文件的最快方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4358875/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 16:27:42  来源:igfitidea点击:

Fastest way to write an array of integers to a file in Java?

javaperformancefile-io

提问by Ollie Glass

As the title says, I'm looking for the fastest possible way to write integer arrays to files. The arrays will vary in size, and will realistically contain anywhere between 2500 and 25 000 000 ints.

正如标题所说,我正在寻找将整数数组写入文件的最快方法。数组的大小会有所不同,实际上将包含 2500 到 25 000 000 个整数之间的任何位置。

Here's the code I'm presently using:

这是我目前使用的代码:

DataOutputStream writer = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(filename)));

for (int d : data)
  writer.writeInt(d);

Given that DataOutputStream has a method for writing arrays of bytes, I've tried converting the int array to a byte array like this:

鉴于 DataOutputStream 有一种写入字节数组的方法,我尝试将 int 数组转换为这样的字节数组:

private static byte[] integersToBytes(int[] values) throws IOException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    DataOutputStream dos = new DataOutputStream(baos);
    for (int i = 0; i < values.length; ++i) {
        dos.writeInt(values[i]);
    }

    return baos.toByteArray();
}

and like this:

像这样:

private static byte[] integersToBytes2(int[] src) {
    int srcLength = src.length;
    byte[] dst = new byte[srcLength << 2];

    for (int i = 0; i < srcLength; i++) {
        int x = src[i];
        int j = i << 2;
        dst[j++] = (byte) ((x >>> 0) & 0xff);
        dst[j++] = (byte) ((x >>> 8) & 0xff);
        dst[j++] = (byte) ((x >>> 16) & 0xff);
        dst[j++] = (byte) ((x >>> 24) & 0xff);
    }
    return dst;
}

Both seem to give a minor speed increase, about 5%. I've not tested them rigorously enough to confirm that.

两者似乎都有轻微的速度提升,大约 5%。我还没有对它们进行足够严格的测试来证实这一点。

Are there any techniques that will speed up this file write operation, or relevant guides to best practice for Java IO write performance?

是否有任何技术可以加速此文件写入操作,或有关 Java IO 写入性能最佳实践的相关指南?

采纳答案by cletus

I had a look at three options:

我看了三个选项:

  1. Using DataOutputStream;
  2. Using ObjectOutputStream(for Serializableobjects, which int[]is); and
  3. Using FileChannel.
  1. 使用DataOutputStream;
  2. 使用ObjectOutputStream(对于Serializable对象,这int[]是);和
  3. 使用FileChannel.

The results are

结果是

DataOutputStream wrote 1,000,000 ints in 3,159.716 ms
ObjectOutputStream wrote 1,000,000 ints in 295.602 ms
FileChannel wrote 1,000,000 ints in 110.094 ms

So the NIO version is the fastest. It also has the advantage of allowing edits, meaning you can easily change one int whereas the ObjectOutputStreamwould require reading the entire array, modifying it and writing it out to file.

所以NIO版本是最快的。它还具有允许编辑的优点,这意味着您可以轻松更改一个 int 而ObjectOutputStream需要读取整个数组,修改它并将其写入文件。

Code follows:

代码如下:

private static final int NUM_INTS = 1000000;

interface IntWriter {
  void write(int[] ints);
}

public static void main(String[] args) {
  int[] ints = new int[NUM_INTS];
  Random r = new Random();
  for (int i=0; i<NUM_INTS; i++) {
    ints[i] = r.nextInt();
  }
  time("DataOutputStream", new IntWriter() {
    public void write(int[] ints) {
      storeDO(ints);
    }
  }, ints);
  time("ObjectOutputStream", new IntWriter() {
    public void write(int[] ints) {
      storeOO(ints);
    }
  }, ints);
  time("FileChannel", new IntWriter() {
    public void write(int[] ints) {
      storeFC(ints);
    }
  }, ints);
}

private static void time(String name, IntWriter writer, int[] ints) {
  long start = System.nanoTime();
  writer.write(ints);
  long end = System.nanoTime();
  double ms = (end - start) / 1000000d;
  System.out.printf("%s wrote %,d ints in %,.3f ms%n", name, ints.length, ms);
}

private static void storeOO(int[] ints) {
  ObjectOutputStream out = null;
  try {
    out = new ObjectOutputStream(new FileOutputStream("object.out"));
    out.writeObject(ints);
  } catch (IOException e) {
    throw new RuntimeException(e);
  } finally {
    safeClose(out);
  }
}

private static void storeDO(int[] ints) {
  DataOutputStream out = null;
  try {
    out = new DataOutputStream(new FileOutputStream("data.out"));
    for (int anInt : ints) {
      out.write(anInt);
    }
  } catch (IOException e) {
    throw new RuntimeException(e);
  } finally {
    safeClose(out);
  }
}

private static void storeFC(int[] ints) {
  FileOutputStream out = null;
  try {
    out = new FileOutputStream("fc.out");
    FileChannel file = out.getChannel();
    ByteBuffer buf = file.map(FileChannel.MapMode.READ_WRITE, 0, 4 * ints.length);
    for (int i : ints) {
      buf.putInt(i);
    }
    file.close();
  } catch (IOException e) {
    throw new RuntimeException(e);
  } finally {
    safeClose(out);
  }
}

private static void safeClose(OutputStream out) {
  try {
    if (out != null) {
      out.close();
    }
  } catch (IOException e) {
    // do nothing
  }
}

回答by Steve Townsend

Array is Serializable - can't you just use writer.writeObject(data);? That's definitely going to be faster than individual writeIntcalls.

数组是可序列化的——你不能只使用writer.writeObject(data);吗?这肯定会比单独writeInt调用更快。

If you have other requirements on the output data format than retrieval into int[], that's a different question.

如果您对输出数据格式有其他要求而不是检索到int[],那是一个不同的问题。

回答by Lachezar Balev

I think you should consider using file channels (the java.nio library) instead of plain streams (java.io). A good starting point is this interesting discussion: Java NIO FileChannel versus FileOutputstream performance / usefulness

我认为您应该考虑使用文件通道(java.nio 库)而不是普通流(java.io)。一个很好的起点是这个有趣的讨论:Java NIO FileChannel 与 FileOutputstream 性能/有用性

and the relevant comments below.

以及下面的相关评论。

Cheers!

干杯!

回答by Peter Lawrey

The main improvement you can have for writing int[] is to either;

编写 int[] 的主要改进是:

  • increase the buffer size. The size is right for most stream, but file access can be faster with a larger buffer. This could yield a 10-20% improvement.

  • Use NIO and a direct buffer. This allows you to write 32-bit values without converting to bytes. This may yield a 5% improvement.

  • 增加缓冲区大小。大小适合大多数流,但使用更大的缓冲区可以更快地访问文件。这可以产生 10-20% 的改进。

  • 使用 NIO 和直接缓冲区。这允许您在不转换为字节的情况下写入 32 位值。这可能会产生 5% 的改进。

BTW: You should be able to write at least 10 million int values per second. With disk caching you increase this to 200 million per second.

顺便说一句:您应该能够每秒写入至少 1000 万个 int 值。通过磁盘缓存,您可以将其增加到每秒 2 亿。

回答by dacwe

I would use FileChannelfrom the niopackage and ByteBuffer. This approach seems (on my computer) give 2 to 4 times better write performance:

我会FileChannelnio包和ByteBuffer. 这种方法似乎(在我的电脑上)提供了2 到 4 倍的写入性能

Output from program:

程序输出:

normal time: 2555
faster time: 765

This is the program:

这是程序:

public class Test {

    public static void main(String[] args) throws IOException {

        // create a test buffer
        ByteBuffer buffer = createBuffer();

        long start = System.currentTimeMillis();
        {
            // do the first test (the normal way of writing files)
            normalToFile(new File("first"), buffer.asIntBuffer());
        }
        long middle = System.currentTimeMillis(); 
        {
            // use the faster nio stuff
            fasterToFile(new File("second"), buffer);
        }
        long done = System.currentTimeMillis();

        // print the result
        System.out.println("normal time: " + (middle - start));
        System.out.println("faster time: " + (done - middle));
    }

    private static void fasterToFile(File file, ByteBuffer buffer) 
    throws IOException {

        FileChannel fc = null;

        try {

            fc = new FileOutputStream(file).getChannel();
            fc.write(buffer);

        } finally {

            if (fc != null)
                fc.close();

            buffer.rewind();
        }
    }

    private static void normalToFile(File file, IntBuffer buffer) 
    throws IOException {

        DataOutputStream writer = null;

        try {
            writer = 
                new DataOutputStream(new BufferedOutputStream(
                        new FileOutputStream(file)));

            while (buffer.hasRemaining())
                writer.writeInt(buffer.get());

        } finally {
            if (writer != null)
                writer.close();

            buffer.rewind();
        }
    }

    private static ByteBuffer createBuffer() {
        ByteBuffer buffer = ByteBuffer.allocate(4 * 25000000);
        Random r = new Random(1);

        while (buffer.hasRemaining()) 
            buffer.putInt(r.nextInt());

        buffer.rewind();

        return buffer;
    }
}

回答by Bj?rn Lindqvist

Benchmarks should be repeated every once in a while, shouldn't they? :) After fixing some bugs and adding my own writing variant, here are the results I get when running the benchmark on an ASUS ZenBook UX305 running Windows 10 (times given in seconds):

基准测试应该每隔一段时间重复一次,不是吗?:) 在修复了一些错误并添加了我自己的写作变体之后,以下是我在运行 Windows 10 的 ASUS ZenBook UX305 上运行基准测试时得到的结果(时间以秒为单位):

Running tests... 0 1 2
Buffered DataOutputStream           8,14      8,46      8,30
FileChannel alt2                    1,55      1,18      1,12
ObjectOutputStream                  9,60     10,41     11,68
FileChannel                         1,49      1,20      1,21
FileChannel alt                     5,49      4,58      4,66

And here are the results running on the same computer but with Arch Linux and the order of the write methods switched:

以下是在同一台计算机上运行的结果,但使用 Arch Linux 并切换了写入方法的顺序:

Running tests... 0 1 2
Buffered DataOutputStream          31,16      6,29      7,26
FileChannel                         1,07      0,83      0,82
FileChannel alt2                    1,25      1,71      1,42
ObjectOutputStream                  3,47      5,39      4,40
FileChannel alt                     2,70      3,27      3,46

Each test wrote an 800mb file. The unbuffered DataOutputStream took way to long so I excluded it from the benchmark.

每个测试写了一个 800mb 的文件。无缓冲的 DataOutputStream 花费了很长时间,所以我将它从基准测试中排除了。

As seen, writing using a file channel still beats the crap out of all other methods, but it matters a lot whether the byte buffer is memory-mapped or not. Without memory-mapping the file channel write took 3-5 seconds:

正如所见,使用文件通道写入仍然优于所有其他方法,但字节缓冲区是否为内存映射非常重要。如果没有内存映射,文件通道写入需要 3-5 秒:

var bb = ByteBuffer.allocate(4 * ints.length);
for (int i : ints)
    bb.putInt(i);
bb.flip();
try (var fc = new FileOutputStream("fcalt.out").getChannel()) {
    fc.write(bb);
}

With memory-mapping, the time was reduced to between 0.8 to 1.5 seconds:

通过内存映射,时间减少到 0.8 到 1.5 秒之间:

try (var fc = new RandomAccessFile("fcalt2.out", "rw").getChannel()) {
    var bb = fc.map(READ_WRITE, 0, 4 * ints.length);
    bb.asIntBuffer().put(ints);
}

But note that the results are order-dependent. Especially so on Linux. It appears that the memory-mapped methods doesn't write the data in full but rather offloads the job request to the OS and returns before it is completed. Whether that behaviour is desirable or not depends on the situation.

但请注意,结果是顺序相关的。在 Linux 上尤其如此。似乎内存映射方法并没有完全写入数据,而是将作业请求卸载到操作系统并在完成之前返回。这种行为是否可取取决于具体情况。

Memory-mapping can also lead to OutOfMemory problems so it is not always the right tool to use. Prevent OutOfMemory when using java.nio.MappedByteBuffer.

内存映射也可能导致 OutOfMemory 问题,因此它并不总是适合使用的工具。使用 java.nio.MappedByteBuffer 时防止 OutOfMemory

Here is my version of the benchmark code: https://gist.github.com/bjourne/53b7eabc6edea27ffb042e7816b7830b

这是我的基准代码版本:https: //gist.github.com/bjourne/53b7eabc6edea27ffb042e7816b7830b