如何在 C# 中编写超快的文件流代码？

Question

提问by ala

I have to split a huge file into many smaller files. Each of the destination files is defined by an offset and length as the number of bytes. I'm using the following code:

我必须将一个大文件拆分为许多较小的文件。每个目标文件都由偏移量和长度定义为字节数。我正在使用以下代码：

private void copy(string srcFile, string dstFile, int offset, int length)
{
    BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
    reader.BaseStream.Seek(offset, SeekOrigin.Begin);
    byte[] buffer = reader.ReadBytes(length);

    BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
    writer.Write(buffer);
}

Considering that I have to call this function about 100,000 times, it is remarkably slow.

考虑到我必须调用这个函数大约 100,000 次，它非常慢。

Is there a way to make the Writer connected directly to the Reader? (That is, without actually loading the contents into the Buffer in memory.)

有没有办法让 Writer 直接连接到 Reader？（也就是说，没有实际将内容加载到内存中的 Buffer 中。）

Answer 1

采纳答案by Jon Skeet

I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. If you're justsplitting up the file, why not open the input file once, and then just write something like:

我不相信 .NET 中有任何内容允许复制文件的一部分而不将其缓冲在内存中。然而，我觉得这无论如何都是低效的，因为它需要打开输入文件并多次查找。如果您只是拆分文件，为什么不打开输入文件一次，然后只需编写如下内容：

public static void CopySection(Stream input, string targetFile, int length)
{
    byte[] buffer = new byte[8192];

    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well:

这在每次调用时创建缓冲区的效率较低 - 您可能希望创建一次缓冲区并将其传递给方法：

public static void CopySection(Stream input, string targetFile,
                               int length, byte[] buffer)
{
    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

Note that this also closes the output stream (due to the using statement) which your original code didn't.

请注意，这也会关闭原始代码没有的输出流（由于 using 语句）。

The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking.

重要的一点是，这将更有效地使用操作系统文件缓冲，因为您重用了相同的输入流，而不是在开始时重新打开文件然后查找。

I thinkit'll be significantly faster, but obviously you'll need to try it to see...

我认为它会快得多，但显然你需要尝试看看......

This assumes contiguous chunks, of course. If you need to skip bits of the file, you can do that from outside the method. Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStreamwrapping the input stream.

当然，这假设是连续的块。如果您需要跳过文件的某些部分，您可以从方法外部执行此操作。此外，如果您正在编写非常小的文件，您可能也希望针对这种情况进行优化 - 最简单的方法可能是引入BufferedStream包装输入流。

Answer 2

回答by Marc Gravell

How large is length? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader... just use Stream.Readand Stream.Write.

有多大length？您可能会更好地重新使用固定大小（中等大，但不淫秽）的缓冲区，而忘记BinaryReader...只需使用Stream.Readand Stream.Write。

(edit) something like:

（编辑）类似：

private static void copy(string srcFile, string dstFile, int offset,
     int length, byte[] buffer)
{
    using(Stream inStream = File.OpenRead(srcFile))
    using (Stream outStream = File.OpenWrite(dstFile))
    {
        inStream.Seek(offset, SeekOrigin.Begin);
        int bufferLength = buffer.Length, bytesRead;
        while (length > bufferLength &&
            (bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
        while (length > 0 &&
            (bytesRead = inStream.Read(buffer, 0, length)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }        
}

Answer 3

回答by schnaader

You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. Also, it might help if you order your seeks, so you don't make big jumps inside the file.

每次复制时都不应该重新打开源文件，最好打开一次并将生成的 BinaryReader 传递给复制函数。此外，如果您对搜索进行排序可能会有所帮助，因此您不会在文件内进行大跳跃。

If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example:

如果长度不是太大，您还可以尝试通过将彼此靠近的偏移分组并读取它们所需的整个块来尝试对多个复制调用进行分组，例如：

offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000

can be grouped to one read:

可以归为一个阅读：

offset = 1234, length = 1074

Then you only have to "seek" in your buffer and can write the three new files from there without having to read again.

然后你只需要在你的缓冲区中“寻找”并且可以从那里写入三个新文件而不必再次读取。

Answer 4

回答by JMarsch

The first thing I would recommend is to take measurements. Where are you losing your time? Is it in the read, or the write?

我建议的第一件事是进行测量。你在哪里浪费时间？是在读，还是写？

Over 100,000 accesses (sum the times): How much time is spent allocating the buffer array? How much time is spent opening the file for read (is it the same file every time?) How much time is spent in read and write operations?

超过 100,000 次访问（总次数）：分配缓冲区数组花费了多少时间？打开文件读用了多少时间（每次都是同一个文件？）读写操作用了多少时间？

If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? (try it, do you get identical output? does it save time?)

如果您没有对文件进行任何类型的转换，您是否需要 BinaryWriter，或者您可以使用文件流进行写入？（试试看，你得到相同的输出吗？它节省时间吗？）

Answer 5

回答by Richard

(For future reference.)

（备查。）

Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management).

很可能最快的方法是使用内存映射文件（因此主要是复制内存，操作系统通过其分页/内存管理处理文件读/写）。

Memory Mapped files are supported in managed code in .NET 4.0.

.NET 4.0 中的托管代码支持内存映射文件。

But as noted, you need to profile, and expect to switch to native code for maximum performance.

但如前所述，您需要进行分析，并期望切换到本机代码以获得最大性能。

Answer 6

回答by TheSean

No one suggests threading? Writing the smaller files looks like text book example of where threads are useful. Set up a bunch of threads to create the smaller files. this way, you can create them all in parallel and you don't need to wait for each one to finish. My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. and of course you should verify first that a sequential approach is not adequate.

没有人建议穿线吗？编写较小的文件看起来像教科书的例子，说明线程的用处。设置一堆线程来创建较小的文件。这样，您可以并行创建它们，而无需等待每个都完成。我的假设是创建文件（磁盘操作）将比拆分数据花费更长的时间。当然，您应该首先验证顺序方法是不够的。

Answer 7

回答by HasaniH

Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this.

您是否考虑过使用 CCR，因为您正在写入单独的文件，您可以并行执行所有操作（读取和写入），并且 CCR 使执行此操作变得非常容易。

static void Main(string[] args)
    {
        Dispatcher dp = new Dispatcher();
        DispatcherQueue dq = new DispatcherQueue("DQ", dp);

        Port<long> offsetPort = new Port<long>();

        Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
            new Handler<long>(Split)));

        FileStream fs = File.Open(file_path, FileMode.Open);
        long size = fs.Length;
        fs.Dispose();

        for (long i = 0; i < size; i += split_size)
        {
            offsetPort.Post(i);
        }
    }

    private static void Split(long offset)
    {
        FileStream reader = new FileStream(file_path, FileMode.Open, 
            FileAccess.Read);
        reader.Seek(offset, SeekOrigin.Begin);
        long toRead = 0;
        if (offset + split_size <= reader.Length)
            toRead = split_size;
        else
            toRead = reader.Length - offset;

        byte[] buff = new byte[toRead];
        reader.Read(buff, 0, (int)toRead);
        reader.Dispose();
        File.WriteAllBytes("c:\out" + offset + ".txt", buff);
    }

This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. This causes you to open the file multiple times but gets rid of the need for synchronization. You can make it more memory efficient but you'll have to sacrifice speed.

此代码将偏移量发布到 CCR 端口，这会导致创建一个线程来执行 Split 方法中的代码。这会导致您多次打开文件，但无需同步。您可以提高内存效率，但必须牺牲速度。

Answer 8

回答by mcauthorn

Using FileStream + StreamWriter I know it's possible to create massive files in little time (less than 1 min 30 seconds). I generate three files totaling 700+ megabytes from one file using that technique.

使用 FileStream + StreamWriter 我知道可以在很短的时间内（少于 1 分 30 秒）创建大量文件。我使用该技术从一个文件生成了三个文件，总计 700 多兆字节。

Your primary problem with the code you're using is that you are opening a file every time. That is creating file I/O overhead.

您使用的代码的主要问题是您每次都打开一个文件。那就是创建文件 I/O 开销。

If you knew the names of the files you would be generating ahead of time, you could extract the File.OpenWrite into a separate method; it will increase the speed. Without seeing the code that determines how you are splitting the files, I don't think you can get much faster.

如果您提前知道要生成的文件的名称，则可以将 File.OpenWrite 提取到单独的方法中；它会提高速度。如果没有看到决定如何拆分文件的代码，我认为您不会变得更快。

Answer 9

回答by Bob Bryan

The fastest way to do file I/O from C# is to use the Windows ReadFile and WriteFile functions. I have written a C# class that encapsulates this capability as well as a benchmarking program that looks at differnet I/O methods, including BinaryReader and BinaryWriter. See my blog post at:

从 C# 执行文件 I/O 的最快方法是使用 Windows ReadFile 和 WriteFile 函数。我编写了一个 C# 类来封装此功能，以及一个查看不同 I/O 方法（包括 BinaryReader 和 BinaryWriter）的基准测试程序。请参阅我的博客文章：

http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/

如何在 C# 中编写超快的文件流代码？

提问by ala

采纳答案by Jon Skeet

回答by Marc Gravell

回答by schnaader

回答by JMarsch

回答by Richard

回答by TheSean

回答by HasaniH

回答by mcauthorn

回答by Bob Bryan

相关推荐

最近更新

标签

如何在 C# 中编写超快的文件流代码？

提问by ala

采纳答案by Jon Skeet

回答by Marc Gravell

回答by schnaader

回答by JMarsch

回答by Richard

回答by TheSean

回答by HasaniH

回答by mcauthorn

回答by Bob Bryan

相关推荐

Linux 侦听网络端口并将数据保存到文本文件

C# 将 Linq 查询结果转换为字典

Linux 如何在 Makefile 中将 dir 添加到 $PATH 中？

Linux 如何获取 system() 运行的命令的状态

相关推荐

最近更新

标签