我应该在 C++ 中使用什么 XML 解析器?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9387610/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 12:46:32  来源:igfitidea点击:

What XML parser should I use in C++?

c++xml-parsingc++-faq

提问by Nicol Bolas

I have XML documents that I need to parse and/or I need to build XML documents and write them to text (either files or memory). Since the C++ standard library does not have a library for this, what should I use?

我有 XML 文档需要解析和/或我需要构建 XML 文档并将它们写入文本(文件或内存)。由于 C++ 标准库没有用于此的库,我应该使用什么?

Note:This is intended to be a definitive, C++-FAQ-style question for this. So yes, it is a duplicate of others. I did not simply appropriate those other questions because they tended to ask for something slightly more specific. This question is more generic.

注意:这是一个明确的、C++-FAQ 风格的问题。所以是的,它是其他人的副本。我并没有简单地挪用那些其他问题,因为他们往往会问一些更具体的问题。这个问题比较笼统。

回答by Nicol Bolas

Just like with standard library containers, what library you should use depends on your needs. Here's a convenient flowchart:

就像使用标准库容器一样,您应该使用什么库取决于您的需要。这是一个方便的流程图:

enter image description here

在此处输入图片说明

So the first question is this: What do you need?

所以第一个问题是:你需要什么?

I Need Full XML Compliance

我需要完全的 XML 合规性

OK, so you need to process XML. Not toy XML, realXML. You need to be able to read and write allof the XML specification, not just the low-lying, easy-to-parse bits. You need Namespaces, DocTypes, entity substitution, the works. The W3C XML Specification, in its entirety.

好的,所以您需要处理 XML。不是玩具 XML,真正的XML。您需要能够读取和写入所有XML 规范,而不仅仅是低层、易于解析的部分。您需要命名空间、文档类型、实体替换等。完整的 W3C XML 规范。

The next question is: Does your API need to conform to DOM or SAX?

下一个问题是:您的 API 是否需要符合 DOM 或 SAX?

I Need Exact DOM and/or SAX Conformance

我需要精确的 DOM 和/或 SAX 一致性

OK, so you really need the API to be DOM and/or SAX. It can't just be a SAX-style push parser, or a DOM-style retained parser. It mustbe the actual DOM or the actual SAX, to the extent that C++ allows.

好的,所以您确实需要 API 是 DOM 和/或 SAX。它不能只是一个 SAX 风格的推送解析器,或者一个 DOM 风格的保留解析器。在 C++ 允许的范围内,它必须是实际的 DOM 或实际的 SAX。

You have chosen:

你已经选择:

Xerces

塞尔西斯

That's your choice. It's pretty much the only C++ XML parser/writer that has full (or as near as C++ allows) DOM and SAX conformance. It also has XInclude support, XML Schema support, and a plethora of other features.

那是你的选择。它几乎是唯一具有完整(或接近 C++ 允许)DOM 和 SAX 一致性的 C++ XML 解析器/编写器。它还具有 XInclude 支持、XML Schema 支持和大量其他功能。

It has no real dependencies. It uses the Apache license.

它没有真正的依赖。它使用 Apache 许可证。

I Don't Care About DOM and/or SAX Conformance

我不关心 DOM 和/或 SAX 一致性

You have chosen:

你已经选择:

LibXML2

LibXML2

LibXML2 offers a C-style interface (if that really bothers you, go use Xerces), though the interface is at least somewhat object-based and easily wrapped. It provides a lot of features, like XInclude support (with callbacks so that you can tell it where it gets the file from), an XPath 1.0 recognizer, RelaxNG and Schematron support (though the error messages leave a lotto be desired), and so forth.

LibXML2 提供了一个 C 风格的接口(如果这真的困扰你,去使用 Xerces),尽管该接口至少在某种程度上是基于对象的并且易于包装。它提供了许多功能,例如 XInclude 支持(带有回调,以便您可以告诉它从何处获取文件)、XPath 1.0 识别器、RelaxNG 和 Schematron 支持(尽管错误消息还有很多不足之处),以及等等。

It does have a dependency on iconv, but it can be configured without that dependency. Though that does mean that you'll have a more limited set of possible text encodings it can parse.

它确实依赖于 iconv,但可以在没有这种依赖的情况下进行配置。虽然这确实意味着您将拥有一组更有限的可能的文本编码,但它可以解析。

It uses the MIT license.

它使用 MIT 许可证。

I Do Not Need Full XML Compliance

我不需要完全符合 XML

OK, so full XML compliance doesn't matter to you. Your XML documents are either fully under your control or are guaranteed to use the "basic subset" of XML: no namespaces, entities, etc.

好的,所以完全符合 XML 对您来说并不重要。您的 XML 文档要么完全在您的控制之下,要么保证使用 XML 的“基本子集”:没有名称空间、实体等。

So what does matter to you? The next question is: What is the most important thing to you in your XML work?

那么什么对你来说很重要?下一个问题是:在您的 XML 工作中,对您来说最重要的事情什么?

Maximum XML Parsing Performance

最大的 XML 解析性能

Your application needs to take XML and turn it into C++ datastructures as fast as this conversion can possibly happen.

您的应用程序需要尽可能快地将 XML 转换为 C++ 数据结构。

You have chosen:

你已经选择:

RapidXML

快速XML

This XML parser is exactly what it says on the tin: rapid XML. It doesn't even deal with pulling the file into memory; how that happens is up to you. What it does deal with is parsing that into a series of C++ data structures that you can access. And it does this about as fast as it takes to scan the file byte by byte.

这个 XML 解析器正是它上面所说的:快速 XML。它甚至不处理将文件拉入内存;如何发生取决于你。它处理的是将其解析为一系列您可以访问的 C++ 数据结构。它执行此操作的速度与逐字节扫描文件所需的速度一样快。

Of course, there's no such thing as a free lunch. Like most XML parsers that don't care about the XML specification, Rapid XML doesn't touch namespaces, DocTypes, entities (with the exception of character entities and the 6 basic XML ones), and so forth. So basically nodes, elements, attributes, and such.

当然,天下没有免费的午餐。与大多数不关心 XML 规范的 XML 解析器一样,Rapid XML 不涉及名称空间、DocType、实体(字符实体和 6 个基本 XML 实体除外)等。所以基本上是节点、元素、属性等等。

Also, it is a DOM-style parser. So it does require that you read all of the text in. However, what it doesn't do is copyany of that text (usually). The way RapidXML gets most of its speed is by refering to strings in-place. This requires more memory management on your part (you must keep that string alive while RapidXML is looking at it).

此外,它是一个 DOM 风格的解析器。所以它确实要求您阅读所有文本。但是,它不做的是复制任何文本(通常)。RapidXML获得大部分速度的方法是指的字符串原地。这需要您进行更多的内存管理(您必须在 RapidXML 查看它时保持该字符串处于活动状态)。

RapidXML's DOM is bare-bones. You can get string values for things. You can search for attributes by name. That's about it. There are no convenience functions to turn attributes into other values (numbers, dates, etc). You just get strings.

RapidXML 的 DOM 是最基本的。您可以获得事物的字符串值。您可以按名称搜索属性。就是这样。没有将属性转换为其他值(数字、日期等)的便捷函数。你只是得到字符串。

One other downside with RapidXML is that it is painful for writingXML. It requires you to do a lot of explicit memory allocation of string names in order to build its DOM. It does provide a kind of string buffer, but that still requires a lot of explicit work on your end. It's certainly functional, but it's a pain to use.

RapidXML 的另一个缺点是编写XML很痛苦。它要求您对字符串名称进行大量显式内存分配以构建其 DOM。它确实提供了一种字符串缓冲区,但这仍然需要您进行大量明确的工作。它当然有用,但使用起来很痛苦。

It uses the MIT licence. It is a header-only library with no dependencies.

它使用 MIT 许可证。它是一个只有头文件的库,没有依赖项。

I Care About Performance But Not Quite That Much

我关心性能但不那么关心

Yes, performance matters to you. But maybe you need something a bit less bare-bones. Maybe something that can handle more Unicode, or doesn't require so much user-controlled memory management. Performance is still important, but you want something a little less direct.

是的,性能对您很重要。但也许你需要一些不那么简单的东西。也许可以处理更多 Unicode 的东西,或者不需要那么多用户控制的内存管理。性能仍然很重要,但你想要一些不那么直接的东西。

You have chosen:

你已经选择:

PugiXML

PugiXML

Historically, this served as inspiration for RapidXML. But the two projects have diverged, with Pugi offering more features, while RapidXML is focused entirely on speed.

从历史上看,这是 RapidXML 的灵感来源。但是这两个项目已经出现分歧,Pugi 提供更多功能,而 RapidXML 则完全专注于速度。

PugiXML offers Unicode conversion support, so if you have some UTF-16 docs around and want to read them as UTF-8, Pugi will provide. It even has an XPath 1.0 implementation, if you need that sort of thing.

PugiXML 提供 Unicode 转换支持,因此如果您有一些 UTF-16 文档并希望将它们阅读为 UTF-8,Pugi 将提供。如果您需要那种东西,它甚至还有一个 XPath 1.0 实现。

But Pugi is still quite fast. Like RapidXML, it has no dependencies and is distributed under the MIT License.

但 Pugi 的速度仍然相当快。与 RapidXML 一样,它没有依赖项,并在 MIT 许可证下分发。

Reading Huge Documents

阅读大量文件

You need to read documents that are measured in the gigabytesin size. Maybe you're getting them from stdin, being fed by some other process. Or you're reading them from massive files. Or whatever. The point is, what you need is to nothave to read the entire file into memory all at once in order to process it.

您需要阅读以千兆字节为单位的文档。也许您是从 stdin 获取它们,由其他一些进程提供的。或者您正在从大量文件中读取它们。管他呢。关键是,您需要的是不必为了处理它而一次将整个文件读入内存。

You have chosen:

你已经选择:

LibXML2

LibXML2

Xerces's SAX-style API will work in this capacity, but LibXML2 is here because it's a bit easier to work with. A SAX-style API is a push-API: it starts parsing a stream and just fires off events that you have to catch. You are forced to manage context, state, and so forth. Code that reads a SAX-style API is a lot more spread out than one might hope.

Xerces 的 SAX 风格的 API 将在这种情况下工作,但 LibXML2 在这里是因为它更容易使用。SAX 风格的 API 是一种推送 API:它开始解析一个流,然后触发您必须捕获的事件。您被迫管理上下文、状态等。读取 SAX 样式 API 的代码比人们希望的要分散得多。

LibXML2's xmlReaderobject is a pull-API. You askto go to the next XML node or element; you aren't told. This allows you to store context as you see fit, to handle different entities in a way that's much more readable in code than a bunch of callbacks.

LibXML2 的xmlReader对象是一个 pull-API。您要求转到下一个 XML 节点或元素;你没有被告知。这允许您按照您认为合适的方式存储上下文,以在代码中比一堆回调更具可读性的方式处理不同的实体。

Alternatives

备择方案

Expat

外籍人士

Expat is a well-known C++ parser that uses a pull-parser API. It was written by James Clark.

Expat 是一个著名的 C++ 解析器,它使用拉解析器 API。它是由詹姆斯克拉克写的。

It's current status is active. The most recent version is 2.2.9, which was released on (2019-09-25).

它的当前状态是活动的。最新版本是 2.2.9,发布于 (2019-09-25)。

LlamaXML

美洲驼XML

It is an implementation of an StAX-style API. It is a pull-parser, similar to LibXML2's xmlReaderparser.

它是 StAX 风格的 API 的实现。它是一个拉式解析器,类似于 LibXML2 的xmlReader解析器。

But it hasn't been updated since 2005. So again, Caveat Emptor.

但它自 2005 年以来就没有更新过。再说一次,Caveat Emptor。

XPath Support

XPath 支持

XPath is a system for querying elements within an XML tree. It's a handy way of effectively naming an element or collection of element by common properties, using a standardized syntax. Many XML libraries offer XPath support.

XPath 是一种用于查询 XML 树中元素的系统。这是一种使用标准化语法通过公共属性有效命名元素或元素集合的便捷方式。许多 XML 库都提供 XPath 支持。

There are effectively three choices here:

这里有有效的三种选择:

  • LibXML2: It provides full XPath 1.0 support. Again, it is a C API, so if that bothers you, there are alternatives.
  • PugiXML: It comes with XPath 1.0 support as well. As above, it's more of a C++ API than LibXML2, so you may be more comfortable with it.
  • TinyXML: It does not come with XPath support, but there is the TinyXPathlibrary that provides it. TinyXML is undergoing a conversion to version 2.0, which significantly changes the API, so TinyXPath may not work with the new API. Like TinyXML itself, TinyXPath is distributed under the zLib license.
  • LibXML2:它提供完整的 XPath 1.0 支持。同样,它是一个 C API,所以如果这让您感到困扰,还有其他选择。
  • PugiXML:它还支持 XPath 1.0。如上所述,与 LibXML2 相比,它更像是 C++ API,因此您可能更喜欢它。
  • TinyXML:它不附带 XPath 支持,但有提供它的TinyXPath库。TinyXML 正在转换为 2.0 版,这对 API 进行了重大更改,因此 TinyXPath 可能无法与新 API 一起使用。与 TinyXML 本身一样,TinyXPath 也是在 zLib 许可下分发的。

Just Get The Job Done

完成工作

So, you don't care about XML correctness. Performance isn't an issue for you. Streaming is irrelevant. All you want is somethingthat gets XML into memory and allows you to stick it back onto disk again. What youcare about is API.

所以,您不关心 XML 的正确性。性能对您来说不是问题。流媒体无关。所有你想要的是什么是得到XML到内存中,并允许你再坚持它放回盘。什么,你关心的是API。

You want an XML parser that's going to be small, easy to install, trivial to use, and small enough to be irrelevant to your eventual executable's size.

您需要一个 XML 解析器,它体积小、易于安装、易于使用,并且小到与最终可执行文件的大小无关。

You have chosen:

你已经选择:

TinyXML

微小的XML

I put TinyXML in this slot because it is about as braindead simple to use as XML parsers get. Yes, it's slow, but it's simple and obvious. It has a lot of convenience functions for converting attributes and so forth.

我把 TinyXML 放在这个位置,因为它和 XML 解析器一样易于使用。是的,它很慢,但它简单明了。它有很多方便的功能来转换属性等等。

Writing XML is no problem in TinyXML. You just newup some objects, attach them together, send the document to a std::ostream, and everyone's happy.

在 TinyXML 中编写 XML 没有问题。您只需new放置一些对象,将它们附加在一起,将文档发送到std::ostream,每个人都很高兴。

There is also something of an ecosystem built around TinyXML, with a more iterator-friendly API, and even an XPath 1.0 implementation layered on top of it.

还有一些围绕 TinyXML 构建的生态系统,具有对迭代器更友好的 API,甚至在它之上分层的 XPath 1.0 实现。

TinyXML uses the zLib license, which is more or less the MIT License with a different name.

TinyXML 使用 zLib 许可证,它或多或少是具有不同名称的 MIT 许可证。

回答by Boris Kolpackov

There is another approach to handling XML that you may want to consider, called XML data binding. Especially if you already have a formal specification of your XML vocabulary, for example, in XML Schema.

您可能需要考虑另一种处理 XML 的方法,称为 XML 数据绑定。特别是如果您已经有了 XML 词汇表的正式规范,例如,在 XML Schema 中。

XML data binding allows you to use XML without actually doing any XML parsing or serialization. A data binding compiler auto-generates all the low-level code and presents the parsed data as C++ classes that correspond to your application domain. You then work with this data by calling functions, and working with C++ types (int, double, etc) instead of comparing strings and parsing text (which is what you do with low-level XML access APIs such as DOM or SAX).

XML 数据绑定允许您使用 XML,而无需实际进行任何 XML 解析或序列化。数据绑定编译器会自动生成所有低级代码,并将解析后的数据呈现为与您的应用程序域相对应的 C++ 类。然后,您可以通过调用函数和 C++ 类型(int、double 等)来处理这些数据,而不是比较字符串和解析文本(这是您使用 DOM 或 SAX 等低级 XML 访问 API 所做的)。

See, for example, an open-source XML data binding implementation that I wrote, CodeSynthesis XSDand, for a lighter-weight, dependency-free version, CodeSynthesis XSD/e.

例如,请参阅我编写的开源 XML 数据绑定实现 CodeSynthesis XSDCodeSynthesis XSD/e 的轻量级无依赖版本。

回答by breakpoint

One other note about Expat: it's worth looking at for embedded systems work. However, the documentation you are likely to find on the web is ancient and wrong. The source code actually has fairly thorough function-level comments, but it will take some perusing for them to make sense.

关于 Expat 的另一个注意事项:嵌入式系统的工作值得一看。但是,您可能会在网络上找到的文档既古老又错误。源代码实际上有相当详尽的函数级注释,但需要仔细阅读才能使其有意义。

回答by Victor Gubin

Ok then. I've created new one, since none of the list wasn't statisfy my needs.

好吧。我创建了一个新的,因为列表中没有一个不能满足我的需求。

Benefits:

好处:

  1. Pull-parser Streaming API on the low level (Java StAX like)
  2. Exceptions and RTTI modes of supported
  3. Limit for memory usage, support for large files (tested with 100 mib XMark filefrom, speed depends on hardware)
  4. UNICODE support, and auto-detecting for input source encoding
  5. High level API for reading into structures/POCO
  6. Meta-programming API for writing and generating XSD from structures/POCOwith support for xml structure (attributes and nesting tags) (XSD generation need RTTI, but can be used only on debug to make it once)
  7. C++ 11 - GCC and VC++ 15+
  1. 底层的拉解析器流 API(Java StAX 之类
  2. 支持的异常和 RTTI 模式
  3. 内存使用限制,支持大文件(使用 100 mib XMark 文件测试,速度取决于硬件)
  4. UNICODE 支持,并自动检测输入源编码
  5. 用于读入结构/POCO的高级API
  6. 用于从结构/POCO编写和生成 XSD 的元编程 API ,支持 xml 结构(属性和嵌套标签)(XSD 生成需要 RTTI,但只能在调试时使用以生成一次)
  7. C++ 11 - GCC 和 VC++ 15+

Disadvantages:

缺点:

  1. DTD and XSD validation not yet provided
  2. Obtaining XML/XSD by HTTP/HTTPS in progress, not yet done
  3. New library
  1. 尚未提供 DTD 和 XSD 验证
  2. 正在通过 HTTP/HTTPS 获取 XML/XSD,尚未完成
  3. 新图书馆

Project home

项目首页

回答by Michael Chourdakis

回答by Michael Haephrati

In Secured Globe, Inc. we use rapidxml. We tried all the others but rapidxml seems to be the best choice for us.

Secured Globe, Inc. 中,我们使用Rapidxml。我们尝试了所有其他方法,但 Rapidxml 似乎是我们的最佳选择。

Here is an example:

下面是一个例子:

 rapidxml::xml_document<char> doc;
    doc.parse<0>(xmlData);
    rapidxml::xml_node<char>* root = doc.first_node();

    rapidxml::xml_node<char>* node_account = 0;
    if (GetNodeByElementName(root, "Account", &node_account) == true)
    {
        rapidxml::xml_node<char>* node_default = 0;
        if (GetNodeByElementName(node_account, "default", &node_default) == true)
        {
            swprintf(result, 100, L"%hs", node_default->value());
            free(xmlData);
            return true;
        }
    }
    free(xmlData);