java 为什么 sax 解析比 dom 解析快?stax 是如何运作的?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3825206/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 03:33:13  来源:igfitidea点击:

why is sax parsing faster than dom parsing ? and how does stax work?

javaxmldomsaxstax

提问by andersonbd1

somewhat related to: libxml2 from java

有点相关:来自java的libxml2

yes, this question is rather long-winded - sorry. I kept is as dense as I felt possible. I bolded the questions to make it easier to peek at before reading the whole thing.

是的,这个问题相当冗长 - 抱歉。我保持尽可能密集。我将问题加粗,以便在阅读整篇文章之前更容易浏览。

Why is sax parsing faster than dom parsing?The only thing I can come up with is that w/ sax you're probably ignoring the majority of the incoming data, and thus not wasting time processing parts of the xml you don't care about. IOW - after parsing w/ SAX, you can't recreate the original input. If you wrote your SAX parser so that it accounted for each and every xml node (and could thus recreate the original), then it wouldn't be any faster than DOM would it?

为什么 sax 解析比 dom 解析快?我唯一能想到的是,使用 sax,您可能会忽略大部分传入数据,因此不会浪费时间处理您不关心的 xml 部分。IOW - 使用 SAX 解析后,您无法重新创建原始输入。 如果您编写了 SAX 解析器,以便它考虑到每个 xml 节点(从而可以重新创建原始节点),那么它不会比 DOM 快吗?

The reason I'm asking is that I'm trying to parse xml documents more quickly. I need to have access to the entire xml tree AFTER parsing. I am writing a platform for 3rd party services to plug into, so I can't anticipate what parts of the xml document will be needed and which parts won't. I don't even know the structure of the incoming document. This is why I can't use jaxb or sax. Memory footprint isn't an issue for me because the xml documents are small and I only need 1 in memory at a time. It's the time it takes to parse this relatively small xml document that is killing me. I haven't used stax before, but perhaps I need to investigate further because it might be the middle ground? If I understand correctly, stax keeps the original xml structure and processes the parts that I ask for on demand?In this way, the original parse time might be quick, but each time I ask it to traverse part of the tree it hasn't yet traversed, that's when the processing takes place?

我问的原因是我试图更快地解析 xml 文档。我需要在解析后访问整个 xml 树。我正在编写一个供 3rd 方服务插入的平台,所以我无法预测 xml 文档的哪些部分将需要,哪些部分不需要。我什至不知道传入文档的结构。这就是我不能使用 jaxb 或 sax 的原因。内存占用对我来说不是问题,因为 xml 文档很小,而且我一次只需要 1 个内存。解析这个相对较小的 xml 文档所花费的时间让我很沮丧。我以前没有使用过 stax,但也许我需要进一步调查,因为它可能是中间地带? 如果我理解正确的话这样,原始解析时间可能很快,但是每次我要求它遍历尚未遍历的树的一部分时,这就是处理发生的时间?

If you provide a link that answers most of the questions, I will accept your answer (you don't have to directly answer my questions if they're already answered elsewhere).

如果您提供回答大多数问题的链接,我将接受您的回答(如果我的问题已在其他地方得到解答,您不必直接回答)。

update: I rewrote it in sax and it parses documents on avg 2.1 ms. This is an improvement (16% faster) over the 2.5 ms that dom was taking, however it is not the magnitude that I (et al) would've guessed

更新:我用 sax 重写了它,它解析文档的平均时间为 2.1 毫秒。与 dom 所用的 2.5 ms 相比,这是一个改进(快 16%),但这并不是我(等人)猜测的幅度

Thanks

谢谢

回答by bdoughan

Assuming you do nothing but parse the document, the ranking of the different parser standards is as follows:

假设你只解析文档,不同解析器标准的排名如下:

1. StAX is the fastest

1. StAX 是最快的

  • The event is reported to you
  • 事件报告给你

2. SAX is next

2. SAX 是下一个

  • It does everything StAX does plus the content is realized automatically (element name, namespace, attributes, ...)
  • 它完成 StAX 所做的一切,并且自动实现内容(元素名称、命名空间、属性……)

3. DOM is last

3. DOM 是最后的

  • It does everything SAX does and presents the information as an instance of Node.
  • 它完成 SAX 所做的一切,并将信息呈现为 Node.js 的一个实例。

Your Use Case

您的用例

  • If you need to maintain all of the XML, DOM is the standard representation. It integrates cleanly with XSLT transforms (javax.xml.transform), XPath (javax.xml.xpath), and schema validation (javax.xml.validation) APIs. However if performance is key, you may be able to build your own tree structure using StAX faster than a DOM parser could build a DOM.
  • 如果需要维护所有的 XML,DOM 是标准表示。它与 XSLT 转换(javax.xml.transform)、XPath ( javax.xml.xpath) 和模式验证 ( javax.xml.validation) API完美集成。但是,如果性能是关键,那么您可以使用 StAX 构建自己的树结构,比使用 DOM 解析器构建 DOM 的速度更快。

回答by mikerobi

DOM parsing requires you to load the entire document into memory and then traverse a tree to find the information you want.

DOM 解析需要您将整个文档加载到内存中,然后遍历一棵树以找到您想要的信息。

SAX only requires as much memory as you need to do basic IO, and you can extract the information that you need as the document is being read. Because SAX is stream oriented, you can even process a file which is still being written by another process.

SAX 只需要执行基本 IO 所需的内存,并且可以在读取文档时提取所需的信息。因为 SAX 是面向流的,所以您甚至可以处理仍在由另一个进程写入的文件。

回答by erickson

SAX is faster because DOM parsers often use a SAX parser to parse a document internally, then do the extra work of creating and manipulating objects to represent each and every node, even if the application doesn't care about them.

SAX 更快,因为 DOM 解析器通常使用 SAX 解析器在内部解析文档,然后做额外的工作来创建和操作对象来表示每个节点,即使应用程序不关心它们。

An application that uses SAX directly is likely to utilize the information set more efficiently than a DOM "parser" does.

直接使用 SAX 的应用程序可能比 DOM“解析器”更有效地利用信息集。

StAX is a happy medium where an application gets a more convenient API than SAX's event-driven approach, yet doesn't suffer the inefficiency of creating a complete DOM.

StAX 是一种快乐的媒介,其中应用程序获得了比 SAX 的事件驱动方法更方便的 API,但不会因为创建完整 DOM 的效率低下而受到影响。

回答by Buhake Sindi

SAX is faster than DOM (usually felt when reading large XML document) because SAX gives you information as a sequence of events (usually accessed through a handler) while DOM creates Nodes and manages the node creation structure until a DOM tree is fully created (as represented in the XML document).

SAX 比 DOM 快(通常在阅读大型 XML 文档时会感觉到),因为 SAX 以事件序列的形式提供信息(通常通过处理程序访问),而 DOM 创建节点并管理节点创建结构,直到完全创建 DOM 树(如在 XML 文档中表示)。

For relatively small files, you won't feel the effect (except that possibly that extra processing is done by DOM to create Node element and/or Node lists).

对于相对较小的文件,您不会感觉到效果(除了可能由 DOM 完成额外处理以创建节点元素和/或节点列表)。

I can't really comment on StAX since I've never played with it.

我无法真正评论 StAX,因为我从未玩过它。