C++ 如何将二进制文件读入无符号字符向量

Question

提问by LihO

Lately I've been asked to write a function that reads the binary file into the std::vector<BYTE>where BYTEis an unsigned char. Quite quickly I came with something like this:

最近，我被要求编写一个函数，将二进制文件读入std::vector<BYTE>whereBYTE是一个unsigned char. 很快我就得到了这样的东西：

#include <fstream>
#include <vector>
typedef unsigned char BYTE;

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::streampos fileSize;
    std::ifstream file(filename, std::ios::binary);

    // get its size:
    file.seekg(0, std::ios::end);
    fileSize = file.tellg();
    file.seekg(0, std::ios::beg);

    // read the data:
    std::vector<BYTE> fileData(fileSize);
    file.read((char*) &fileData[0], fileSize);
    return fileData;
}

which seems to be unnecessarily complicated and the explicit cast to char*that I was forced to use while calling file.readdoesn't make me feel any better about it.

这似乎不必要地复杂，char*而且我在调用时被迫使用的显式转换file.read并没有让我感觉更好。

Another option is to use std::istreambuf_iterator:

另一种选择是使用std::istreambuf_iterator：

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::ifstream file(filename, std::ios::binary);

    // read the data:
    return std::vector<BYTE>((std::istreambuf_iterator<char>(file)),
                              std::istreambuf_iterator<char>());
}

which is pretty simple and short, but still I have to use the std::istreambuf_iterator<char>even when I'm reading into std::vector<unsigned char>.

这非常简单和简短，但std::istreambuf_iterator<char>即使我正在阅读std::vector<unsigned char>.

The last option that seems to be perfectly straightforward is to use std::basic_ifstream<BYTE>, which kinda expresses it explicitly that "I want an input file stream and I want to use it to read BYTEs":

最后一个似乎非常简单的选项是 use std::basic_ifstream<BYTE>，它明确表示“我想要一个输入文件流，我想用它来读取BYTEs”：

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::basic_ifstream<BYTE> file(filename, std::ios::binary);

    // read the data:
    return std::vector<BYTE>((std::istreambuf_iterator<BYTE>(file)),
                              std::istreambuf_iterator<BYTE>());
}

but I'm not sure whether basic_ifstreamis an appropriate choice in this case.

但我不确定basic_ifstream在这种情况下是否是合适的选择。

What is the best way of reading a binary file into the vector?I'd also like to know what's happening "behind the scene"and what are the possible problems I might encounter (apart from stream not being opened properly which might be avoided by simple is_opencheck).

将二进制文件读入vector. 我还想知道“幕后”发生了什么以及我可能遇到的可能问题是什么（除了流没有被正确打开，这可以通过简单的is_open检查来避免）。

Is there any good reason why one would prefer to use std::istreambuf_iteratorhere?
(the only advantage that I can see is simplicity)

有什么好的理由让人们更喜欢在std::istreambuf_iterator这里使用吗？
（我能看到的唯一优点是简单）

Answer 1

采纳答案by jww

When testing for performance, I would include a test case for:

在测试性能时，我会包含一个测试用例：

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::ifstream file(filename, std::ios::binary);

    // Stop eating new lines in binary mode!!!
    file.unsetf(std::ios::skipws);

    // get its size:
    std::streampos fileSize;

    file.seekg(0, std::ios::end);
    fileSize = file.tellg();
    file.seekg(0, std::ios::beg);

    // reserve capacity
    std::vector<BYTE> vec;
    vec.reserve(fileSize);

    // read the data:
    vec.insert(vec.begin(),
               std::istream_iterator<BYTE>(file),
               std::istream_iterator<BYTE>());

    return vec;
}

My thinking is that the constructor of Method 1 touches the elements in the vector, and then the readtouches each element again.

我的想法是方法一的构造函数先接触到中的元素vector，然后再read接触每个元素。

Method 2 and Method 3 look most promising, but could suffer one or more resize's. Hence the reason to reservebefore reading or inserting.

方法 2 和方法 3 看起来最有希望，但可能会遇到一个或多个resize。因此，reserve在阅读或插入之前的原因。

I would also test with std::copy:

我也会测试std::copy：

...
std::vector<byte> vec;
vec.reserve(fileSize);

std::copy(std::istream_iterator<BYTE>(file),
          std::istream_iterator<BYTE>(),
          std::back_inserter(vec));

In the end, I think the best solution will avoid operator >>from istream_iterator(and all the overhead and goodness from operator >>trying to interpret binary data). But I don't know what to use that allows you to directly copy the data into the vector.

最后，我认为最好的解决方案将避免operator >>从istream_iterator（和所有的开销和善良的operator >>试图解释二进制数据）。但是我不知道使用什么可以让您直接将数据复制到向量中。

Finally, my testing with binary data is showing ios::binaryis not being honored. Hence the reason for noskipwsfrom <iomanip>.

最后，我对二进制数据的测试表明ios::binary没有得到尊重。因此，之所以noskipws从<iomanip>。

Answer 2

回答by neoneye

std::ifstream stream("mona-lisa.raw", std::ios::in | std::ios::binary);
std::vector<uint8_t> contents((std::istreambuf_iterator<char>(stream)), std::istreambuf_iterator<char>());

for(auto i: contents) {
    int value = i;
    std::cout << "data: " << value << std::endl;
}

std::cout << "file size: " << contents.size() << std::endl;

Answer 3

回答by Maxim Egorushkin

Since you are loading the entire file into memory the most optimal version is to map the file into memory. This is because the kernel loads the file into kernel page cache anyway and by mapping the file you just expose those pages in the cache into your process. Also known as zero-copy.

由于您要将整个文件加载到内存中，因此最佳版本是将文件映射到内存中。这是因为内核无论如何都会将文件加载到内核页面缓存中，并且通过映射文件，您只需将缓存中的这些页面公开到您的进程中。也称为零拷贝。

When you use std::vector<>it copies the data from the kernel page cache into std::vector<>which is unnecessary when you just want to read the file.

当您使用std::vector<>它时，它会将数据从内核页面缓存复制到std::vector<>其中，当您只想读取文件时是不必要的。

Also, when passing two input iterators to std::vector<>it grows its buffer while reading because it does not know the file size. When resizing std::vector<>to the file size first it needlessly zeroes out its contents because it is going to be overwritten with file data anyway. Both of the methods are sub-optimal in terms of space and time.

此外，当将两个输入迭代器传递给std::vector<>它时，它会在读取时增加缓冲区，因为它不知道文件大小。当首先调整std::vector<>到文件大小时，它不必要地将其内容清零，因为无论如何它都会被文件数据覆盖。这两种方法在空间和时间方面都是次优的。

Answer 4

回答by Mats Petersson

I would have thought that the first method, using the size and using stream::read()would be the most efficient. The "cost" of casting to char *is most likely zero - casts of this kind simply tell the compiler that "Hey, I know you think this is a different type, but I really want this type here...", and does not add any extra instrucitons - if you wish to confirm this, try reading the file into a char array, and compare the actual assembler code. Aside from a little bit of extra work to figure out the address of the buffer inside the vector, there shouldn't be any difference.

我会认为第一种方法，使用大小和使用stream::read()将是最有效的。char *强制转换的“成本”很可能为零——这种类型的强制转换只是告诉编译器“嘿，我知道你认为这是一种不同的类型，但我真的想要这里的类型......”，而不是添加任何额外的指令 - 如果您想确认这一点，请尝试将文件读入字符数组，然后比较实际的汇编代码。除了一些额外的工作来找出向量内缓冲区的地址之外，应该没有任何区别。

As always, the only way to tell for sure IN YOUR CASE what is the most efficient is to measure it. "Asking on the internet" is not proof.

与往常一样，在您的情况下确定什么是最有效的唯一方法是对其进行测量。“在互联网上询问”不是证据。

C++ 如何将二进制文件读入无符号字符向量

提问by LihO

采纳答案by jww

回答by neoneye

回答by Maxim Egorushkin

回答by Mats Petersson

相关推荐

最近更新

标签

C++ 如何将二进制文件读入无符号字符向量

提问by LihO

采纳答案by jww

回答by neoneye

回答by Maxim Egorushkin

回答by Mats Petersson

相关推荐

C++ 使用 dumpbin.exe 的 DLL 函数名称

C++ 如何将 DLL 链接到我的项目？错误 LNK2019：未解析的外部符号

C++ #pragma once vs 包含守卫？

C++ 如何在没有 boost::timer 的情况下以毫秒为单位计时函数

相关推荐

最近更新

标签