C++ 解析二进制文件。什么是现代方式?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/26845538/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-28 20:41:07  来源:igfitidea点击:

Parsing a binary file. What is a modern way?

c++castingbinary

提问by nikitablack

I have a binary file with some layout I know. For example let format be like this:

我有一个二进制文件,其中有一些我知道的布局。例如让格式是这样的:

  • 2 bytes (unsigned short) - length of a string
  • 5 bytes (5 x chars) - the string - some id name
  • 4 bytes (unsigned int) - a stride
  • 24 bytes (6 x float - 2 strides of 3 floats each) - float data
  • 2 个字节 (unsigned short) - 字符串的长度
  • 5 个字节(5 x 字符) - 字符串 - 一些 id 名称
  • 4 个字节(无符号整数)- 一个大步
  • 24 字节(6 x 浮点数 - 2 个步长,每个 3 个浮点数) - 浮点数据

The file should look like (I added spaces for readability):

该文件应如下所示(我为可读性添加了空格):

5 hello 3 0.0 0.1 0.2 -0.3 -0.4 -0.5

Here 5 - is 2 bytes: 0x05 0x00. "hello" - 5 bytes and so on.

这里 5 - 是 2 个字节:0x05 0x00。“你好” - 5 个字节等等。

Now I want to read this file. Currently I do it so:

现在我想阅读这个文件。目前我这样做:

  • load file to ifstream
  • read this stream to char buffer[2]
  • cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.
  • read a stream to vector<char>and create a std::stringfrom this vector. Now I have string id.
  • the same way read next 4 bytes and cast them to unsigned int. Now I have a stride.
  • while not end of file read floats the same way - create a char bufferFloat[4]and cast *((float*)bufferFloat)for every float.
  • 将文件加载到ifstream
  • 将此流读取到 char buffer[2]
  • 将其转换为 unsigned short: unsigned short len{ *((unsigned short*)buffer) };。现在我有一个字符串的长度。
  • 读取一个流到这个向量vector<char>std::string从这个向量中创建一个。现在我有字符串 id。
  • 以同样的方式读取接下来的 4 个字节并将它们转换为 unsigned int。现在我有一个大步。
  • 而不是文件结束读取以相同的方式浮动 -为每个浮动创建一个char bufferFloat[4]并转换*((float*)bufferFloat)

This works, but for me it looks ugly. Can I read directly to unsigned shortor floator stringetc. without char [x]creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?

这有效,但对我来说它看起来很难看。我可以直接读取到unsigned shortfloatstring没有等char [x]创造?如果不是,正确投射的方法是什么(我读过我正在使用的那种风格 - 是一种旧风格)?

P.S.: while I wrote a question, the more clearer explanation raised in my head - how to cast arbitrary number of bytes from arbitrary position in char [x]?

PS:虽然我写了一个问题,但我脑子里提出了更清晰的解释——如何从 中的任意位置投射任意数量的字节char [x]

Update: I forgot to mention explicitly that string and float data length is not known at compile time and is variable.

更新:我忘了明确提到字符串和浮点数据长度在编译时是未知的,并且是可变的。

采纳答案by slaphappy

The C way, which would work fine in C++, would be to declare a struct:

在 C++ 中可以正常工作的 C 方式是声明一个结构:

#pragma pack(1)

struct contents {
   // data members;
};

Note that

注意

  • You need to use a pragma to make the compiler align the data as-it-looksin the struct;
  • This technique only works with POD types
  • 您需要使用 pragma 使编译器按原样对齐结构中的数据;
  • 此技术仅适用于POD 类型

And then cast the read buffer directly into the struct type:

然后将读取缓冲区直接转换为 struct 类型:

std::vector<char> buf(sizeof(contents));
file.read(buf.data(), buf.size());
contents *stuff = reinterpret_cast<contents *>(buf.data());

Now if your data's size is variable, you can separate in several chunks. To read a single binary object from the buffer, a reader function comes handy:

现在,如果您的数据大小是可变的,您可以分成几个块。要从缓冲区读取单个二进制对象,reader 函数就派上用场了:

template<typename T>
const char *read_object(const char *buffer, T& target) {
    target = *reinterpret_cast<const T*>(buffer);
    return buffer + sizeof(T);
}

The main advantage is that such a reader can be specialized for more advanced c++ objects:

主要优点是这样的阅读器可以专门用于更高级的 c++ 对象:

template<typename CT>
const char *read_object(const char *buffer, std::vector<CT>& target) {
    size_t size = target.size();
    CT const *buf_start = reinterpret_cast<const CT*>(buffer);
    std::copy(buf_start, buf_start + size, target.begin());
    return buffer + size * sizeof(CT);
}

And now in your main parser:

现在在你的主解析器中:

int n_floats;
iter = read_object(iter, n_floats);
std::vector<float> my_floats(n_floats);
iter = read_object(iter, my_floats);

Note:As Tony D observed, even if you can get the alignment right via #pragmadirectives and manual padding (if needed), you may still encounter incompatibility with your processor's alignment, in the form of (best case) performance issues or (worst case) trap signals. This method is probably interesting only if you have control over the file's format.

注意:正如 Tony D 所观察到的,即使您可以通过#pragma指令和手动填充(如果需要)获得正确的对齐方式,您仍然可能会遇到与处理器对齐方式不兼容的问题,表现为(最佳情况)性能问题或(最坏情况)陷阱信号。只有当您可以控制文件的格式时,此方法才可能有趣。

回答by fjardon

If it is not for learning purpose, and if you have freedom in choosing the binary format you'd better consider using something like protobufwhich will handle the serialization for you and allow to interoperate with other platforms and languages.

如果不是为了学习目的,并且如果您可以自由选择二进制格式,您最好考虑使用类似protobuf 的东西,它会为您处理序列化并允许与其他平台和语言进行互操作。

If you cannot use a third party API, you may look at QDataStreamfor inspiration

如果你不能使用第三方 API,你可以看看QDataStream灵感

回答by Tony Delroy

Currently I do it so:

  • load file to ifstream

  • read this stream to char buffer[2]

  • cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.

目前我这样做:

  • 将文件加载到 ifstream

  • 将此流读取到字符缓冲区[2]

  • 将其投射到unsigned short: unsigned short len{ *((unsigned short*)buffer) };。现在我有一个字符串的长度。

That last risks a SIGBUS(if your character array happens to start at an odd address and your CPU can only read 16-bit values that are aligned at an even address), performance (some CPUs will read misaligned values but slower; others like modern x86s are fine and fast) and/or endiannessissues. I'd suggest reading the two characters then you can say (x[0] << 8) | x[1]or vice versa, using htonsif needing to correct for endianness.

最后一个风险是SIGBUS(如果您的字符数组碰巧从奇数地址开始,而您的 CPU 只能读取在偶数地址对齐的 16 位值)、性能(某些 CPU 会读取未对齐的值但速度较慢;其他如现代 x86很好而且很快)和/或字节顺序问题。我建议阅读这两个字符然后你可以说(x[0] << 8) | x[1]或者反之亦然,htons如果需要纠正字节序。

  • read a stream to vector<char>and create a std::stringfrom this vector. Now I have string id.
  • 读取一个流vector<char>std::string从中创建一个vector。现在我有字符串 id。

No need... just read directly into the string:

不需要...只需直接读入字符串:

std::string s(the_size, ' ');

if (input_fstream.read(&s[0], s.size()) &&
    input_stream.gcount() == s.size())
    ...use s...
  • the same way readnext 4 bytes and cast them to unsigned int. Now I have a stride. whilenot end of file readfloats the same way - create a char bufferFloat[4]and cast *((float*)bufferFloat)for every float.
  • 以同样的方式read接下来 4 个字节并将它们转换为unsigned int. 现在我有一个大步。 while不是文件结束readfloatS也是一样的方式-创造char bufferFloat[4]和投*((float*)bufferFloat)每一个float

Better to read the data directly over the unsigned ints and floats, as that way the compiler will ensure correct alignment.

最好直接通过unsigned ints 和读取数据floats,因为这样编译器将确保正确对齐。

This works, but for me it looks ugly. Can I read directly to unsigned shortor floator stringetc. without char [x]creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?

这有效,但对我来说它看起来很难看。我可以直接读取到unsigned shortfloatstring没有等char [x]创造?如果不是,正确投射的方法是什么(我读过我正在使用的那种风格 - 是一种旧风格)?

struct Data
{
    uint32_t x;
    float y[6];
};
Data data;
if (input_stream.read((char*)&data, sizeof data) &&
    input_stream.gcount() == sizeof data)
    ...use x and y...

Note the code above avoids reading data into potentially unaligned character arrays, wherein it's unsafe to reinterpret_castdata in a potentially unaligned chararray (including inside a std::string) due to alignment issues. Again, you may need some post-read conversion with htonlif there's a chance the file content differs in endianness. If there's an unknown number of floats, you'll need to calculate and allocate sufficient storage with alignment of at least 4 bytes, then aim a Data*at it... it's legal to index past the declared array size of yas long as the memory content at the accessed addresses was part of the allocation and holds a valid floatrepresentation read in from the stream. Simpler - but with an additional read so possibly slower - read the uint32_tfirst then new float[n]and do a further readinto there....

请注意,上面的代码避免了将数据读入潜在未对齐的字符数组,其中由于对齐问题,reinterpret_cast潜在未对齐的char数组(包括 a 内部std::string)中的数据是不安全的。同样,htonl如果文件内容的字节序有可能不同,您可能需要进行一些读后转换。如果存在未知数量的floats,则需要计算并分配足够的存储空间,对齐至少为 4 个字节,然后瞄准Data*它...索引超过声明的数组大小y与内存内容一样长是合法的在访问的地址是分配的一部分,并保存float从流中读取的有效表示。更简单 - 但有额外的阅读所以可能更慢 - 阅读uint32_t首先然后new float[n]read深入那里......

Practically, this type of approach can work and a lot of low level and C code does exactly this. "Cleaner" high-level libraries that might help you read the file must ultimately be doing something similar internally....

实际上,这种方法可以工作,并且许多低级和 C 代码正是这样做的。可以帮助您阅读文件的“更清洁”高级库最终必须在内部做类似的事情......

回答by Matthieu M.

I actually implemented a quick and dirty binary format parser to read .zipfiles (following Wikipedia's format description) just last month, and being modern I decided to use C++ templates.

.zip就在上个月,我实际上实现了一个快速而肮脏的二进制格式解析器来读取文件(遵循维基百科的格式描述),并且现代我决定使用 C++ 模板。

On some specific platforms, a packed structcould work, however there are things it does not handle well... such as fields of variable length. With templates, however, there is no such issue: you can get arbitrarily complex structures (and return types).

在某些特定平台上,packedstruct可以工作,但是有些事情它不能很好地处理……例如可变长度的字段。但是,使用模板就没有这样的问题:您可以获得任意复杂的结构(和返回类型)。

A .ziparchive is relatively simple, fortunately, so I implemented something simple. Off the top of my head:

一个.zip存档相对简单,幸运的是,所以我实现了一些简单的东西。在我的头顶:

using Buffer = std::pair<unsigned char const*, size_t>;

template <typename OffsetReader>
class UInt16LEReader: private OffsetReader {
public:
    UInt16LEReader() {}
    explicit UInt16LEReader(OffsetReader const or): OffsetReader(or) {}

    uint16_t read(Buffer const& buffer) const {
        OffsetReader const& or = *this;

        size_t const offset = or.read(buffer);
        assert(offset <= buffer.second && "Incorrect offset");
        assert(offset + 2 <= buffer.second && "Too short buffer");

        unsigned char const* begin = buffer.first + offset;

        // http://commandcenter.blogspot.fr/2012/04/byte-order-fallacy.html
        return (uint16_t(begin[0]) << 0)
             + (uint16_t(begin[1]) << 8);
    }
}; // class UInt16LEReader

// Declined for UInt[8|16|32][LE|BE]...

Of course, the basic OffsetReaderactually has a constant result:

当然,基本OffsetReader实际上有一个恒定的结果:

template <size_t O>
class FixedOffsetReader {
public:
    size_t read(Buffer const&) const { return O; }
}; // class FixedOffsetReader

and since we are talking templates, you can switch the types at leisure (you could implement a proxy reader which delegates all reads to a shared_ptrwhich memoizes them).

并且由于我们正在讨论模板,因此您可以随意切换类型(您可以实现一个代理阅读器,将所有阅读委托给一个shared_ptr记忆它们的阅读器)。

What is interesting, though, is the end-result:

然而,有趣的是最终结果:

// http://en.wikipedia.org/wiki/Zip_%28file_format%29#File_headers
class LocalFileHeader {
public:
    template <size_t O>
    using UInt32 = UInt32LEReader<FixedOffsetReader<O>>;
    template <size_t O>
    using UInt16 = UInt16LEReader<FixedOffsetReader<O>>;

    UInt32< 0> signature;
    UInt16< 4> versionNeededToExtract;
    UInt16< 6> generalPurposeBitFlag;
    UInt16< 8> compressionMethod;
    UInt16<10> fileLastModificationTime;
    UInt16<12> fileLastModificationDate;
    UInt32<14> crc32;
    UInt32<18> compressedSize;
    UInt32<22> uncompressedSize;

    using FileNameLength = UInt16<26>;
    using ExtraFieldLength = UInt16<28>;

    using FileName = StringReader<FixedOffsetReader<30>, FileNameLength>;

    using ExtraField = StringReader<
        CombinedAdd<FixedOffsetReader<30>, FileNameLength>,
        ExtraFieldLength
    >;

    FileName filename;
    ExtraField extraField;
}; // class LocalFileHeader

This is rather simplistic, obviously, but incredibly flexible at the same time.

显然,这相当简单,但同时又非常灵活。

An obvious axis of improvement would be to improve chainingsince here there is a risk of accidental overlaps. My archive reading code worked the first time I tried it though, which was evidence enough for me that this code was sufficient for the task at hand.

一个明显的改进轴是改进链接,因为这里存在意外重叠的风险。我的档案阅读代码在我第一次尝试时就起作用了,这对我来说足以证明这段代码足以完成手头的任务。

回答by Gene

I had to solve this problem once. The data files were packed FORTRAN output. Alignments were all wrong. I succeeded with preprocessor tricks that did automatically what you are doing manually: unpack the raw data from a byte buffer to a struct. The idea is to describe the data in an include file:

我不得不解决一次这个问题。数据文件被打包为 FORTRAN 输出。对齐方式全错了。我成功地使用了预处理器技巧,这些技巧可以自动完成您手动执行的操作:将原始数据从字节缓冲区解包到结构体。这个想法是在包含文件中描述数据:

BEGIN_STRUCT(foo)
    UNSIGNED_SHORT(length)
    STRING_FIELD(length, label)
    UNSIGNED_INT(stride)
    FLOAT_ARRAY(3 * stride)
END_STRUCT(foo)

Now you can define these macros to generate the code you need, say the struct declaration, include the above, undef and define the macros again to generate unpacking functions, followed by another include, etc.

现在你可以定义这些宏来生成你需要的代码,比如结构声明,包括上面的,undef 并再次定义宏以生成解包函数,然后是另一个包含等。

NB I first saw this technique used in gcc for abstract syntax tree-related code generation.

NB 我第一次看到这种技术在 gcc 中用于抽象语法树相关的代码生成。

If CPP is not powerful enough (or such preprocessor abuse is not for you), substitute a small lex/yacc program (or pick your favorite tool).

如果 CPP 不够强大(或者这种预处理器滥用不适合您),请替换一个小的 lex/yacc 程序(或选择您最喜欢的工具)。

It's amazing to me how often it pays to think in terms of generating code rather than writing it by hand, at least in low level foundation code like this.

让我惊讶的是,考虑生成代码而不是手动编写代码,至少在像这样的低级基础代码中如此频繁。

回答by Barry

Since all of your data is variable, you can read the two blocks separately and still use casting:

由于您的所有数据都是可变的,您可以分别读取这两个块并仍然使用强制转换:

struct id_contents
{
    uint16_t len;
    char id[];
} __attribute__((packed)); // assuming gcc, ymmv

struct data_contents
{
    uint32_t stride;
    float data[];
} __attribute__((packed)); // assuming gcc, ymmv

class my_row
{
    const id_contents* id_;
    const data_contents* data_;
    size_t len;

public:
    my_row(const char* buffer) {
        id_= reinterpret_cast<const id_contents*>(buffer);
        size_ = sizeof(*id_) + id_->len;
        data_ = reinterpret_cast<const data_contents*>(buffer + size_);
        size_ += sizeof(*data_) + 
            data_->stride * sizeof(float); // or however many, 3*float?

    }

    size_t size() const { return size_; }
};

That way you can use Mr. kbok's answer to parse correctly:

这样你就可以使用 kbok 先生的回答来正确解析:

const char* buffer = getPointerToDataSomehow();

my_row data1(buffer);
buffer += data1.size();

my_row data2(buffer);
buffer += data2.size();

// etc.

回答by Ajay

You should better declare a structure (with 1-byte padding - how- depends on compiler). Write using that structure, and read using same structure. Put only POD in structure, and hence no std::stringetc. Use this structure only for file I/O, or other inter-process communication - use normal structor classto hold it for further use in C++ program.

您应该更好地声明一个结构(使用 1 字节填充 -如何- 取决于编译器)。使用该结构写入,并使用相同的结构读取。仅将 POD 放入结构中,因此没有std::string等。仅将此结构用于文件 I/O 或其他进程间通信 - 使用正常structclass保留它以在 C++ 程序中进一步使用。

回答by rev

I personally do it this way:

我个人是这样做的:

// some code which loads the file in memory
#pragma pack(push, 1)
struct someFile { int a, b, c; char d[0xEF]; };
#pragma pack(pop)

someFile* f = (someFile*) (file_in_memory);
int filePropertyA = f->a;

Very effective way for fixed-size structs at the start of the file.

文件开头固定大小结构的非常有效的方法。

回答by átila Neves

Use a serialization library. Here are a few:

使用序列化库。以下是一些:

回答by Dmitry Ponyatov

I use rageltool to generate pure C procedural source code (no tables) for microcontrollers with 1-2K of RAM. It did not use any file io, buffering, and produces both easy to debug code and .dot/.pdf file with state machine diagram.

我使用ragel工具为具有 1-2K RAM 的微控制器生成纯 C 程序源代码(无表)。它没有使用任何文件 io、缓冲,并生成易于调试的代码和带有状态机图的 .dot/.pdf 文件。

ragel can also output go, Java,.. code for parsing, but I did not use these features.

ragel也可以输出go, Java,..代码进行解析,不过我没用过这些功能。

The key feature of ragelis the ability to parse any byte-build data, but you can't dig into bit fields. Other problem is ragel able to parse regular structures but has no recursion and syntax grammar parsing.

的关键特性ragel是能够解析任何字节构建数据,但您无法深入研究位字段。另一个问题是 ragel 能够解析常规结构,但没有递归和语法语法解析。