C++ 以理智、安全和有效的方式复制文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/10195343/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 13:44:04  来源:igfitidea点击:

Copy a file in a sane, safe and efficient way

c++file-io

提问by Peter

I search for a good way to copy a file (binary or text). I've written several samples, everyone works. But I want hear the opinion of seasoned programmers.

我正在寻找一种复制文件(二进制或文本)的好方法。我写了几个样本,每个人都在工作。但我想听听经验丰富的程序员的意见。

I missing good examples and search a way which works with C++.

我错过了很好的例子并搜索了一种适用于 C++ 的方法。

ANSI-C-WAY

ANSI-C-WAY

#include <iostream>
#include <cstdio>    // fopen, fclose, fread, fwrite, BUFSIZ
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    // BUFSIZE default is 8192 bytes
    // BUFSIZE of 1 means one chareter at time
    // good values should fit to blocksize, like 1024 or 4096
    // higher values reduce number of system calls
    // size_t BUFFER_SIZE = 4096;

    char buf[BUFSIZ];
    size_t size;

    FILE* source = fopen("from.ogv", "rb");
    FILE* dest = fopen("to.ogv", "wb");

    // clean and more secure
    // feof(FILE* stream) returns non-zero if the end of file indicator for stream is set

    while (size = fread(buf, 1, BUFSIZ, source)) {
        fwrite(buf, 1, size, dest);
    }

    fclose(source);
    fclose(dest);

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " << end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

POSIX-WAY(K&R use this in "The C programming language", more low-level)

POSIX-WAY(K&R 在“C 编程语言”中使用它,更底层)

#include <iostream>
#include <fcntl.h>   // open
#include <unistd.h>  // read, write, close
#include <cstdio>    // BUFSIZ
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    // BUFSIZE defaults to 8192
    // BUFSIZE of 1 means one chareter at time
    // good values should fit to blocksize, like 1024 or 4096
    // higher values reduce number of system calls
    // size_t BUFFER_SIZE = 4096;

    char buf[BUFSIZ];
    size_t size;

    int source = open("from.ogv", O_RDONLY, 0);
    int dest = open("to.ogv", O_WRONLY | O_CREAT /*| O_TRUNC/**/, 0644);

    while ((size = read(source, buf, BUFSIZ)) > 0) {
        write(dest, buf, size);
    }

    close(source);
    close(dest);

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " << end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

KISS-C++-Streambuffer-WAY

KISS-C++-Streambuffer-WAY

#include <iostream>
#include <fstream>
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    ifstream source("from.ogv", ios::binary);
    ofstream dest("to.ogv", ios::binary);

    dest << source.rdbuf();

    source.close();
    dest.close();

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " <<  end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

COPY-ALGORITHM-C++-WAY

复制算法-C++-方式

#include <iostream>
#include <fstream>
#include <ctime>
#include <algorithm>
#include <iterator>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    ifstream source("from.ogv", ios::binary);
    ofstream dest("to.ogv", ios::binary);

    istreambuf_iterator<char> begin_source(source);
    istreambuf_iterator<char> end_source;
    ostreambuf_iterator<char> begin_dest(dest); 
    copy(begin_source, end_source, begin_dest);

    source.close();
    dest.close();

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " <<  end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

OWN-BUFFER-C++-WAY

自己的缓冲区-C++-方式

#include <iostream>
#include <fstream>
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    ifstream source("from.ogv", ios::binary);
    ofstream dest("to.ogv", ios::binary);

    // file size
    source.seekg(0, ios::end);
    ifstream::pos_type size = source.tellg();
    source.seekg(0);
    // allocate memory for buffer
    char* buffer = new char[size];

    // copy file    
    source.read(buffer, size);
    dest.write(buffer, size);

    // clean up
    delete[] buffer;
    source.close();
    dest.close();

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " <<  end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

LINUX-WAY// requires kernel >= 2.6.33

LINUX-WAY// 需要内核 >= 2.6.33

#include <iostream>
#include <sys/sendfile.h>  // sendfile
#include <fcntl.h>         // open
#include <unistd.h>        // close
#include <sys/stat.h>      // fstat
#include <sys/types.h>     // fstat
#include <ctime>
using namespace std;

int main() {
    clock_t start, end;
    start = clock();

    int source = open("from.ogv", O_RDONLY, 0);
    int dest = open("to.ogv", O_WRONLY | O_CREAT /*| O_TRUNC/**/, 0644);

    // struct required, rationale: function stat() exists also
    struct stat stat_source;
    fstat(source, &stat_source);

    sendfile(dest, source, 0, stat_source.st_size);

    close(source);
    close(dest);

    end = clock();

    cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
    cout << "CPU-TIME START " << start << "\n";
    cout << "CPU-TIME END " << end << "\n";
    cout << "CPU-TIME END - START " <<  end - start << "\n";
    cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";

    return 0;
}

Environment

环境

  • GNU/LINUX (Archlinux)
  • Kernel 3.3
  • GLIBC-2.15, LIBSTDC++ 4.7 (GCC-LIBS), GCC 4.7, Coreutils 8.16
  • Using RUNLEVEL 3 (Multiuser, Network, Terminal, no GUI)
  • INTEL SSD-Postville 80 GB, filled up to 50%
  • Copy a 270 MB OGG-VIDEO-FILE
  • GNU/Linux (Archlinux)
  • 内核 3.3
  • GLIBC-2.15、LIBSTDC++ 4.7 (GCC-LIBS)、GCC 4.7、Coreutils 8.16
  • 使用 RUNLEVEL 3(多用户、网络、终端、无 GUI)
  • INTEL SSD-Postville 80 GB,已满 50%
  • 复制 270 MB OGG-VIDEO-FILE

Steps to reproduce

重现步骤

 1. $ rm from.ogg
 2. $ reboot                           # kernel and filesystem buffers are in regular
 3. $ (time ./program) &>> report.txt  # executes program, redirects output of program and append to file
 4. $ sha256sum *.ogv                  # checksum
 5. $ rm to.ogg                        # remove copy, but no sync, kernel and fileystem buffers are used
 6. $ (time ./program) &>> report.txt  # executes program, redirects output of program and append to file

Results (CPU TIME used)

结果(使用的 CPU 时间)

Program  Description                 UNBUFFERED|BUFFERED
ANSI C   (fread/frwite)                 490,000|260,000  
POSIX    (K&R, read/write)              450,000|230,000  
FSTREAM  (KISS, Streambuffer)           500,000|270,000 
FSTREAM  (Algorithm, copy)              500,000|270,000
FSTREAM  (OWN-BUFFER)                   500,000|340,000  
SENDFILE (native LINUX, sendfile)       410,000|200,000  

Filesize doesn't change.
sha256sum print the same results.
The video file is still playable.

文件大小不会改变。
sha256sum 打印相同的结果。
视频文件仍可播放。

Questions

问题

  • What method would you prefer?
  • Do you know better solutions?
  • Do you see any mistakes in my code?
  • Do you know a reason to avoid a solution?

  • FSTREAM (KISS, Streambuffer)
    I really like this one, because it is really short and simple. As far is I know the operator << is overloaded for rdbuf() and doesn't convert anything. Correct?

  • 你更喜欢哪种方法?
  • 你知道更好的解决方案吗?
  • 你在我的代码中看到任何错误吗?
  • 您知道避免解决方案的原因吗?

  • FSTREAM (KISS, Streambuffer)
    我真的很喜欢这个,因为它真的很短很简单。据我所知,运算符 << 为 rdbuf() 重载并且不转换任何内容。正确的?

Thanks

谢谢

Update 1
I changed the source in all samples in that way, that the open and close of the file descriptors is include in the measurement of clock(). Their are no other significant changes in the source code. The results doesn't changed! I also used timeto double-check my results.

更新 1
我以这种方式更改了所有样本中的源,即文件描述符的打开和关闭包含在clock()的测量中。它们在源代码中没有其他重大变化。结果没变!我还花时间仔细检查我的结果。

Update 2
ANSI C sample changed: The condition of the while-loopdoesn't call any longer feof()instead I moved fread()into the condition. It looks like, the code runs now 10,000 clocks faster.

更新 2
ANSI C 示例更改:while 循环的条件不再调用feof()而是我将fread()移动到条件中。看起来,代码现在运行速度快了 10,000 个时钟。

Measurement changed: The former results were always buffered, because I repeated the old command line rm to.ogv && sync && time ./programfor each program a few times. Now I reboot the system for every program. The unbuffered results are new and show no surprise. The unbuffered results didn't changed really.

测量改变:以前的结果总是被缓冲,因为我为每个程序重复了旧的命令行rm to.ogv && sync && time ./program几次。现在我为每个程序重新启动系统。无缓冲的结果是新的,没有任何意外。无缓冲的结果并没有真正改变。

If i don't delete the old copy, the programs react different. Overwriting a existing file bufferedis faster with POSIX and SENDFILE, all other programs are slower. Maybe the options truncateor createhave a impact on this behaviour. But overwriting existing files with the same copy is not a real world use-case.

如果我不删除旧副本,程序的反应就会不同。使用 POSIX 和 SENDFILE覆盖缓冲的现有文件更快,所有其他程序都更慢。也许选项truncatecreate会影响这种行为。但是用相同的副本覆盖现有文件并不是现实世界的用例。

Performing the copy with cptakes 0.44 seconds unbuffered und 0.30 seconds buffered. So cpis a little bit slower than the POSIX sample. Looks fine for me.

使用cp执行复制需要 0.44 秒无缓冲和 0.30 秒缓冲。所以cp比POSIX样本慢一点。对我来说看起来不错。

Maybe I add also samples and results of mmap()and copy_file()from boost::filesystem.

也许我还添加了mmap()copy_file()boost::filesystem 的样本和结果。

Update 3
I've put this also on a blog page and extended it a little bit. Including splice(), which is a low-level function from the Linux kernel. Maybe more samples with Java will follow. http://www.ttyhoney.com/blog/?page_id=69

更新 3
我也把它放在一个博客页面上,并对其进行了一些扩展。包括splice(),这是来自 Linux 内核的低级函数。也许会有更多的 Java 示例。 http://www.ttyhoney.com/blog/?page_id=69

采纳答案by Martin York

Copy a file in a sane way:

以理智的方式复制文件:

#include <fstream>

int main()
{
    std::ifstream  src("from.ogv", std::ios::binary);
    std::ofstream  dst("to.ogv",   std::ios::binary);

    dst << src.rdbuf();
}

This is so simple and intuitive to read it is worth the extra cost. If we were doing it a lot, better to fall back on OS calls to the file system. I am sure boosthas a copy file method in its filesystem class.

阅读起来如此简单直观,值得付出额外的代价。如果我们经常这样做,最好依靠操作系统对文件系统的调用。我确信boost在其文件系统类中有一个复制文件方法。

There is a C method for interacting with the file system:

有一种与文件系统交互的C方法:

#include <copyfile.h>

int
copyfile(const char *from, const char *to, copyfile_state_t state, copyfile_flags_t flags);

回答by manlio

With C++17 the standard way to copy a file will be including the <filesystem>header and using:

使用 C++17 复制文件的标准方法将包括<filesystem>标题并使用:

bool copy_file( const std::filesystem::path& from,
                const std::filesystem::path& to);

bool copy_file( const std::filesystem::path& from,
                const std::filesystem::path& to,
                std::filesystem::copy_options options);

The first form is equivalent to the second one with copy_options::noneused as options (see also copy_file).

第一种形式与copy_options::none用作选项的第二种形式等效(另请参阅copy_file)。

The filesystemlibrary was originally developed as boost.filesystemand finally merged to ISO C++ as of C++17.

filesystem库最初是作为boost.filesystemC++17开发并最终合并到 ISO C++ 中的。

回答by Potatoswatter

Too many!

太多!

The "ANSI C" way buffer is redundant, since a FILEis already buffered. (The size of this internal buffer is what BUFSIZactually defines.)

“ANSI C”方式缓冲区是多余的,因为 aFILE已经被缓冲了。(这个内部缓冲区的大小是BUFSIZ实际定义的。)

The "OWN-BUFFER-C++-WAY" will be slow as it goes through fstream, which does a lot of virtual dispatching, and again maintains internal buffers or each stream object. (The "COPY-ALGORITHM-C++-WAY" does not suffer this, as the streambuf_iteratorclass bypasses the stream layer.)

“OWN-BUFFER-C++-WAY”在通过时会很慢,它会进行fstream大量虚拟调度,并再次维护内部缓冲区或每个流对象。(“COPY-ALGORITHM-C++-WAY”不受此影响,因为streambuf_iterator该类绕过了流层。)

I prefer the "COPY-ALGORITHM-C++-WAY", but without constructing an fstream, just create bare std::filebufinstances when no actual formatting is needed.

我更喜欢“COPY-ALGORITHM-C++-WAY”,但不构建 . fstream,只std::filebuf在不需要实际格式化时创建裸实例。

For raw performance, you can't beat POSIX file descriptors. It's ugly but portable and fast on any platform.

对于原始性能,您无法击败 POSIX 文件描述符。它在任何平台上都丑陋但便携且快速。

The Linux way appears to be incredibly fast — perhaps the OS let the function return before I/O was finished? In any case, that's not portable enough for many applications.

Linux 方式似乎非常快——也许操作系统让函数在 I/O 完成之前返回?无论如何,这对于许多应用程序来说都不够便携。

EDIT: Ah, "native Linux" may be improving performance by interleaving reads and writes with asynchronous I/O. Letting commands pile up can help the disk driver decide when is best to seek. You might try Boost Asio or pthreads for comparison. As for "can't beat POSIX file descriptors"… well that's true if you're doing anything with the data, not just blindly copying.

编辑:啊,“本机 Linux”可能通过将读写与异步 I/O 交错来提高性能。让命令堆积起来可以帮助磁盘驱动程序决定最佳查找时间。您可以尝试使用 Boost Asio 或 pthreads 进行比较。至于“无法击败 POSIX 文件描述符”……好吧,如果您对数据做任何事情,而不仅仅是盲目复制,那就对了。

回答by rveale

I want to make the veryimportant note that the LINUX method using sendfile() has a major problem in that it can not copy files more than 2GB in size! I had implemented it following this question and was hitting problems because I was using it to copy HDF5 files that were many GB in size.

我要特别注意的是,使用 sendfile() 的 LINUX 方法有一个主要问题,它无法复制大小超过 2GB 的文件!我在这个问题之后实现了它并且遇到了问题,因为我使用它来复制大小为许多 GB 的 HDF5 文件。

http://man7.org/linux/man-pages/man2/sendfile.2.html

http://man7.org/linux/man-pages/man2/sendfile.2.html

sendfile() will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)

sendfile() 最多传输 0x7ffff000 (2,147,479,552) 个字节,返回实际传输的字节数。(在 32 位和 64 位系统上都是如此。)

回答by Donald Duck

Qt has a method for copying files:

Qt有一种复制文件的方法:

#include <QFile>
QFile::copy("originalFile.example","copiedFile.example");

Note that to use this you have to install Qt(instructions here) and include it in your project (if you're using Windows and you're not an administrator, you can download Qt hereinstead). Also see this answer.

请注意,要使用它,您必须安装 Qt此处的说明)并将其包含在您的项目中(如果您使用的是 Windows 并且您不是管理员,则可以在此处下载 Qt )。另请参阅此答案

回答by anhoppe

For those who like boost:

对于喜欢boost的人:

boost::filesystem::path mySourcePath("foo.bar");
boost::filesystem::path myTargetPath("bar.foo");

// Variant 1: Overwrite existing
boost::filesystem::copy_file(mySourcePath, myTargetPath, boost::filesystem::copy_option::overwrite_if_exists);

// Variant 2: Fail if exists
boost::filesystem::copy_file(mySourcePath, myTargetPath, boost::filesystem::copy_option::fail_if_exists);

Note that boost::filesystem::pathis also available as wpathfor Unicode. And that you could also use

请注意boost::filesystem::path也可用作Unicode 的wpath。你也可以使用

using namespace boost::filesystem

if you do not like those long type names

如果你不喜欢那些长类型的名字

回答by kuroi neko

I'm not quite sure what a "good way" of copying a file is, but assuming "good" means "fast", I could broaden the subject a little.

我不太确定复制文件的“好方法”是什么,但假设“好”意味着“快”,我可以稍微扩大一下主题。

Current operating systems have long been optimized to deal with run of the mill file copy. No clever bit of code will beat that. It is possible that some variant of your copy techniques will prove faster in some test scenario, but they most likely would fare worse in other cases.

当前的操作系统长期以来一直被优化以处理工厂文件复制的运行。没有任何巧妙的代码能打败它。在某些测试场景中,您的复制技术的某些变体可能会被证明更快,但在其他情况下它们很可能会更糟。

Typically, the sendfilefunction probably returns before the write has been committed, thus giving the impression of being faster than the rest. I haven't read the code, but it is most certainly because it allocates its own dedicated buffer, trading memory for time. And the reason why it won't work for files bigger than 2Gb.

通常,该sendfile函数可能在写入提交之前返回,因此给人的印象是比其他函数更快。我没有读过代码,但肯定是因为它分配了自己的专用缓冲区,用内存换取时间。以及为什么它不适用于大于 2Gb 的文件的原因。

As long as you're dealing with a small number of files, everything occurs inside various buffers (the C++ runtime's first if you use iostream, the OS internal ones, apparently a file-sized extra buffer in the case of sendfile). Actual storage media is only accessed once enough data has been moved around to be worth the trouble of spinning a hard disk.

只要您处理少量文件,一切都发生在各种缓冲区内(如果您使用的是 C++ 运行时的第一个iostream,操作系统内部的,在 的情况下显然是一个文件大小的额外缓冲区sendfile)。实际的存储介质只有在移动了足够多的数据以值得旋转硬盘的麻烦后才能访问。

I suppose you could slightly improve performances in specific cases. Off the top of my head:

我想你可以在特定情况下稍微提高性能。在我的头顶:

  • If you're copying a huge file on the same disk, using a buffer bigger than the OS's might improve things a bit (but we're probably talking about gigabytes here).
  • If you want to copy the same file on two different physical destinations you will probably be faster opening the three files at once than calling two copy_filesequentially (though you'll hardly notice the difference as long as the file fits in the OS cache)
  • If you're dealing with lots of tiny files on an HDD you might want to read them in batches to minimize seeking time (though the OS already caches directory entries to avoid seeking like crazy and tiny files will likely reduce disk bandwidth dramatically anyway).
  • 如果您在同一个磁盘上复制一个大文件,使用比操作系统更大的缓冲区可能会有所改善(但我们可能在这里谈论的是千兆字节)。
  • 如果你想在两个不同的物理目的地复制同一个文件,你可能会比copy_file顺序调用两个更快地同时打开三个文件(尽管只要文件适合操作系统缓存,你几乎不会注意到差异)
  • 如果您要处理 HDD 上的大量小文件,您可能希望批量读取它们以最大程度地减少查找时间(尽管操作系统已经缓存了目录条目以避免疯狂查找,并且小文件无论如何都可能会显着降低磁盘带宽)。

But all that is outside the scope of a general purpose file copy function.

但所有这些都超出了通用文件复制功能的范围。

So in my arguably seasoned programmer's opinion, a C++ file copy should just use the C++17 file_copydedicated function, unless more is known about the context where the file copy occurs and some clever strategies can be devised to outsmart the OS.

所以在我可以说是经验丰富的程序员看来,C++ 文件复制应该只使用 C++17file_copy专用函数,除非对文件复制发生的上下文有更多的了解,并且可以设计一些聪明的策略来超越操作系统。