在 Linux 中复制文件的最有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7463689/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 06:14:59  来源:igfitidea点击:

Most efficient way to copy a file in Linux

clinux

提问by Radu

I am working at an OS independent file manager, and I am looking at the most efficient way to copy a file for Linux. Windows has a built in function, CopyFileEx(), but from what I've noticed, there is no such standard function for Linux. So I guess I will have to implement my own. The obvious way is fopen/fread/fwrite, but is there a better (faster) way of doing it? I must also have the ability to stop every once in a while so that I can update the "copied so far" count for the file progress menu.

我在一个独立于操作系统的文件管理器工作,我正在寻找为 Linux 复制文件的最有效方法。Windows 有一个内置函数CopyFileEx(),但据我所知,Linux 没有这样的标准函数。所以我想我将不得不实现我自己的。显而易见的方法是 fopen/fread/fwrite,但是有没有更好(更快)的方法呢?我还必须能够每隔一段时间停止一次,以便我可以更新文件进度菜单的“到目前为止已复制”计数。

采纳答案by Nemo

Unfortunately, you cannot use sendfile()here because the destination is not a socket. (The name sendfile()comes from send()+ "file").

不幸的是,您不能sendfile()在这里使用,因为目标不是套接字。(名称sendfile()来自send()+“文件”)。

For zero-copy, you can use splice()as suggested by @Dave. (Except it will not be zero-copy; it will be "one copy" from the source file's page cache to the destination file's page cache.)

对于零拷贝,您可以splice()按照@Dave 的建议使用。(除了它不会是零拷贝;它将是从源文件的页面缓存到目标文件的页面缓存的“一个副本”。)

However... (a) splice()is Linux-specific; and (b) you can almost certainly do just as well using portable interfaces, provided you use them correctly.

但是…… (a)splice()是 Linux 特有的;(b) 如果您正确使用它们,您几乎肯定可以使用可移植接口来做同样的事情。

In short, use open()+ read()+ write()with a smalltemporary buffer. I suggest 8K. So your code would look something like this:

总之,使用open()+ read()+write()临时缓冲器中。我建议8K。所以你的代码看起来像这样:

int in_fd = open("source", O_RDONLY);
assert(in_fd >= 0);
int out_fd = open("dest", O_WRONLY);
assert(out_fd >= 0);
char buf[8192];

while (1) {
    ssize_t read_result = read(in_fd, &buf[0], sizeof(buf));
    if (!read_result) break;
    assert(read_result > 0);
    ssize_t write_result = write(out_fd, &buf[0], read_result);
    assert(write_result == read_result);
}

With this loop, you will be copying 8K from the in_fd page cache into the CPU L1 cache, then writing it from the L1 cache into the out_fd page cache. Then you will overwrite that part of the L1 cache with the next 8K chunk from the file, and so on. The net result is that the data in bufwill never actually be stored in main memory at all (except maybe once at the end); from the system RAM's point of view, this is just as good as using "zero-copy" splice(). Plus it is perfectly portable to any POSIX system.

通过此循环,您将从 in_fd 页缓存中复制 8K 到 CPU L1 缓存,然后将其从 L1 缓存写入 out_fd 页缓存。然后,您将使用文件中的下一个 8K 块覆盖 L1 缓存的那部分,依此类推。最终结果是数据buf实际上永远不会存储在主内存中(除了最后一次);从系统 RAM 的角度来看,这与使用“零拷贝”一样好splice()。此外,它可以完美地移植到任何 POSIX 系统。

Note that the small buffer is key here. Typical modern CPUs have 32K or so for the L1 data cache, so if you make the buffer too big, this approach will be slower. Possibly much, much slower. So keep the buffer in the "few kilobytes" range.

请注意,这里的小缓冲区是关键。典型的现代 CPU 有 32K 左右的 L1 数据缓存,因此如果您将缓冲区设置得太大,这种方法会更慢。可能很多,慢得多。因此,请将缓冲区保持在“几千字节”范围内。

Of course, unless your disk subsystem is very very fast, memory bandwidth is probably not your limiting factor. So I would recommend posix_fadviseto let the kernel know what you are up to:

当然,除非您的磁盘子系统非常快,否则内存带宽可能不是您的限制因素。所以我建议posix_fadvise让内核知道你在做什么:

posix_fadvise(in_fd, 0, 0, POSIX_FADV_SEQUENTIAL);

This will give a hint to the Linux kernel that its read-ahead machinery should be very aggressive.

这将提示 Linux 内核,它的预读机制应该非常积极。

I would also suggest using posix_fallocateto preallocate the storage for the destination file. This will tell you ahead of time whether you will run out of disk. And for a modern kernel with a modern file system (like XFS), it will help to reduce fragmentation in the destination file.

我还建议使用posix_fallocate预分配目标文件的存储。这将提前告诉您是否会用完磁盘。对于具有现代文件系统(如 XFS)的现代内核,它将有助于减少目标文件中的碎片。

The last thing I would recommend is mmap. It is usually the slowest approach of all thanks to TLB thrashing. (Very recent kernels with "transparent hugepages" might mitigate this; I have not tried recently. But it certainly used to be very bad. So I would only bother testing mmapif you have lots of time to benchmark and a very recent kernel.)

我推荐的最后一件事是mmap。由于 TLB 颠簸,这通常是最慢的方法。(具有“透明大页面”的最新内核可能会减轻这种情况;我最近没有尝试过。但它确实曾经非常糟糕。因此,mmap如果您有很多时间进行基准测试和最新内核,我才会费心进行测试。)

[Update]

[更新]

There is some question in the comments about whether splicefrom one file to another is zero-copy. The Linux kernel developers call this "page stealing". Both the man page for spliceand the comments in the kernel sourcesay that the SPLICE_F_MOVEflag should provide this functionality.

评论中有一些关于splice从一个文件到另一个文件是否为零复制的问题。Linux 内核开发人员将此称为“页面窃取”。内核源代码中的手册页splice注释都说该SPLICE_F_MOVE标志应提供此功能。

Unfortunately, the support for SPLICE_F_MOVEwas yanked in 2.6.21 (back in 2007)and never replaced. (The comments in the kernel sources never got updated.) If you search the kernel sources, you will find SPLICE_F_MOVEis not actually referenced anywhere. The last message I can find(from 2008) says it is "waiting for a replacement".

不幸的是,支持SPLICE_F_MOVE猛拉在2.6.21(早在2007年),从来没有更换。(内核源代码中的注释从未更新过。)如果您搜索内核源代码,您会发现SPLICE_F_MOVE实际上并未在任何地方引用。我能找到最后一条消息(来自 2008 年)说它正在“等待替换”。

The bottom line is that splicefrom one file to another calls memcpyto move the data; it is notzero-copy. This is not much better than you can do in userspace using read/writewith small buffers, so you might as well stick to the standard, portable interfaces.

底线是splice从一个文件到另一个调用memcpy移动数据;它不是零拷贝。这并不比您在用户空间中使用read/write和小缓冲区所做的好多少,因此您最好坚持使用标准的、可移植的接口。

If "page stealing" is ever added back into the Linux kernel, then the benefits of splicewould be much greater. (And even today, when the destination is a socket, you get true zero-copy, making splicemore attractive.) But for the purpose of this question, splicedoes not buy you very much.

如果将“页面窃取”重新添加到 Linux 内核中,那么它的好处splice会大得多。(即使在今天,当目的地是一个套接字时,你会得到真正的零拷贝,从而splice更有吸引力。)但对于这个问题的目的,splice并不买你太多。

回答by Michael Ekstrand

Use open/read/write— they avoid the libc-level buffering done by fopenand friends.

使用open/ read/ write— 他们避免了fopen和朋友们做的 libc 级缓冲。

Alternatively, if you are using GLib, you could use its g_copy_filefunction.

或者,如果您使用 GLib,则可以使用其g_copy_file功能。

Finally, what may be faster, but it should be tested to be sure: use openand mmapto memory-map the input file, then writefrom the memory region to the output file. You'll probably want to keep open/read/write around as a fallback, as this method is limited to the address space size of your process.

最后,什么可能更快,但应该进行测试以确保:使用openmmap内存映射输入文件,然后write从内存区域到输出文件。您可能希望保持打开/读取/写入作为后备,因为此方法仅限于进程的地址空间大小。

Edit:original answer suggested mapping both files; @bdonlan made excellent suggestion in comment to only map one.

编辑:原始答案建议映射两个文件;@bdonlan 在评论中提出了很好的建议,只映射一个。

回答by ennuikiller

You may want to benchmark the dd command

您可能想对 dd 命令进行基准测试

回答by Dave

If you know they'll be using a linux > 2.6.17, splice()is the way to do zero-copy in linux:

如果您知道他们将使用 linux > 2.6.17,splice()那么在 linux 中进行零复制的方法是:

 //using some default parameters for clarity below. Don't do this in production.
 #define splice(a, b, c) splice(a, 0, b, 0, c, 0)
 int p[2];
 pipe(p);
 int out = open(OUTFILE, O_WRONLY);
 int in = open(INFILE, O_RDONLY)
 while(splice(p[0], out, splice(in, p[1], 4096))>0);

回答by Richard Hodges

My answer from a more recent duplicate of this post.

我从这篇文章的最近副本中得到的答案。

boost now offers mapped_file_sourcewhich portably models a memory-mapped file.

boost 现在提供mapped_file_source可移植地模拟内存映射文件的功能。

Maybe not as efficient as CopyFileEx()and splice(), but portable and succinct.

也许不如CopyFileEx()and高效splice(),但便携且简洁。

This program takes 2 filename arguments. It copies the first half of the source file to the destination file.

该程序采用 2 个文件名参数。它将源文件的前半部分复制到目标文件。

#include <boost/iostreams/device/mapped_file.hpp>
#include <iostream>
#include <fstream>
#include <cstdio>

namespace iostreams = boost::iostreams;
int main(int argc, char** argv)
{
    if (argc != 3)
    {
        std::cerr << "usage: " << argv[0] << " <infile> <outfile> - copies half of the infile to outfile" << std::endl;
        std::exit(100);
    }

    auto source = iostreams::mapped_file_source(argv[1]);
    auto dest = std::ofstream(argv[2], std::ios::binary);
    dest.exceptions(std::ios::failbit | std::ios::badbit);
    auto first = source. begin();
    auto bytes = source.size() / 2;
    dest.write(first, bytes);
}

Depending on OS, your mileage may vary with system calls such as spliceand sendfile, however note the comments in the man page:

根据操作系统,您的里程可能会因系统调用(例如splicesendfile )而异,但请注意手册页中的注释:

Applications may wish to fall back to read(2)/write(2) in the case where sendfile() fails with EINVAL or ENOSYS.

在 sendfile() 因 EINVAL 或 ENOSYS 失败的情况下,应用程序可能希望回退到 read(2)/write(2)。