Java 是否有适用于 SAX 模型的 XPath 处理器?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1863250/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 23:22:35  来源:igfitidea点击:

Is there any XPath processor for SAX model?

javaxmlxpathsax

提问by user189603

I'm looking for an XPath evaluator that doesn't rebuild the whole DOM document to look for the nodes of a document: actually the object is to manage a large amount of XML data (ideally over 2Gb) with SAX model, which is very good for memory management, and give the possibility to search for nodes.

我正在寻找一个不重建整个 DOM 文档以查找文档节点的 XPath 评估器:实际上该对象是使用 SAX 模型管理大量 XML 数据(理想情况下超过 2Gb),这是非常有利于内存管理,并提供搜索节点的可能性。

Thank you all for the support!

谢谢大家的支持!

For all those who say it's not possible: I recently, after asked the question, found a project named "saxpath" (http://www.saxpath.org/), but I can't find any implementing project.

对于那些说不可能的人:我最近在问这个问题后发现了一个名为“saxpath”(http://www.saxpath.org/)的项目,但我找不到任何实施项目。

回答by Felix Kling

Mmh I don't know if I really understand you. As far as I know, the SAX model is event oriented. That means, you do something if a certain node is encountered during the parsing. Yeah, it is better for memory but I don't see how you would like to get XPath into it. As SAX does not build a model, I don't think that this is possible.

嗯,我不知道我是否真的了解你。据我所知,SAX 模型是面向事件的。这意味着,如果在解析过程中遇到某个节点,你就会做一些事情。是的,它对内存更好,但我不知道您希望如何将 XPath 放入其中。由于 SAX 不构建模型,我认为这是不可能的。

回答by ptriller

I don't think xpath works with SAX, but you might take a look at StAX which is an extended streaming XML API for Java.

我不认为 xpath 与 SAX 一起工作,但您可以看看 StAX,它是 Java 的扩展流 XML API。

http://en.wikipedia.org/wiki/StAX

http://en.wikipedia.org/wiki/StAX

回答by Carl Smotricz

What you could do is hook an XSL transformer to a SAX input source. Your processing will be sequential and the XSL preprocessor will make an attempt to catch the input as it comes to fiddle it into whatever result you specified. You can use this to pull a path's value out of the stream. This would come in especially handy if you wanted to produce a bunch of different XPATH results in one pass.

您可以做的是将 XSL 转换器连接到 SAX 输入源。您的处理将是顺序的,并且 XSL 预处理器将尝试捕获输入,因为它将输入摆弄成您指定的任何结果。您可以使用它从流中提取路径的值。如果您想在一次传递中生成一堆不同的 XPATH 结果,这将特别方便。

You'll get (typically) an XML document as a result, but you could pull your expected output out of, say, a StreamResultwith not too much hassle.

结果,您将获得(通常)一个 XML 文档,但是您可以StreamResult轻松地将您的预期输出从 a 中提取出来。

回答by skaffman

The standard javax xpath API technically already works with streams; javax.xml.xpath.XPathExpressioncan be evaluated against an InputSource, which in turn can be constructed with a Reader. I don't think it constructs a DOM under the covers.

从技术上讲,标准的 javax xpath API 已经适用于流;javax.xml.xpath.XPathExpression可以针对 an 进行评估,而 anInputSource又可以用 a 构造Reader。我不认为它在幕后构造了 DOM。

回答by Pavel Minaev

SAX is forward-only, while XPath queries can navigate the document in any direction (consider parent::, ancestor::, preceding::and preceding-sibling::axis). I don't see how this would be possible in general. The best approximation would be some sort of lazy-loading DOM, but depending on your queries this may or may not give you any benefit - there is always a worst-case query such as //*[. != preceding::*].

SAX是只进,而XPath查询可以在任何方向导航文件(考虑parent::ancestor::preceding::preceding-sibling::轴)。我不明白这通常是怎么可能的。最好的近似是某种延迟加载 DOM,但取决于您的查询,这可能会给您带来好处,也可能不会给您带来任何好处 - 总是有最坏情况的查询,例如//*[. != preceding::*].

回答by vtd-xml-author

There are SAX/StAX based XPath implementations, but they only support a small subset of XPath expressions/axis largely due to SAX/StAX's forward only nature.. the best alternative I am aware of is extended VTD-XML, it supports full xpath, partial document loading via mem-map.. and a max document size of 256GB, but you will need 64-bit JVM to use it to its full potential

有基于 SAX/StAX 的 XPath 实现,但它们仅支持一小部分 XPath 表达式/轴,这主要是由于 SAX/StAX 的前向性质。我所知道的最佳替代方案是扩展VTD-XML,它支持完整的 xpath,通过 mem-map.. 加载部分文档,最大文档大小为 256GB,但您需要 64 位 JVM 才能充分利用它

回答by Thorbj?rn Ravn Andersen

Have a look at the streaming mode of the Saxon-SA XSLT-processor.

查看 Saxon-SA XSLT 处理器的流模式。

http://www.saxonica.com/documentation/sourcedocs/serial.html

http://www.saxonica.com/documentation/sourcedocs/serial.html

"The rules that determine whether a path expression can be streamed are:

“确定路径表达式是否可以流式传输的规则是:

  • The expression to be streamed starts with a call on the document() or doc() function.
  • The path expression introduced by the call on doc() or document must conform to a subset of XPath defined as follows:

  • any XPath expression is acceptable if it conforms to the rules for path expressions appearing in identity constraints in XML Schema. These rules allow no predicates; the first step (but only the first) can be introduced with "//"; the last step can optionally use the attribute axis; all other steps must be simple Axis Steps using the child axis.

  • In addition, Saxon allows the expression to contain a union, for example doc()/(*/ABC | /XYZ). Unions can also be expressed in abbreviated form, for example the above can be written as doc()//(ABC|XYZ).
  • The expression must either select elements only, or attributes only, or a mixture of elements and attributes.

  • Simple filters (one or more) are also supported. Each filter may apply to the last step or to the expression as a whole, and it must only use downward selection from the context node (the self, child, attribute, descendant, descendant-or-self, or namespace axes). It must not be positional (that is, it must not reference position() or last(), and must not be numeric: in fact, it must be such that Saxon can determine at compile time that it will not be numeric). Filters cannot be applied to unions or to branches of unions. Any violation of these conditions causes the expression to be evaluated without the streaming optimization.

  • These rules apply after other optimization rewrites have been applied to the expression. For example, some FLWOR expressions may be rewritten to a path expression that satisfies these rules.

  • The optimization is enabled only if explicitly requested, either by using the saxon:stream() extension function, or the saxon:read-once attribute on anXSLT xsl:copy-of instruction, or the XQuery pragma saxon:stream. It is available only if the stylesheet or query is processed using Saxon-SA."

  • 要流式传输的表达式以调用 document() 或 doc() 函数开始。
  • 对 doc() 或 document 的调用引入的路径表达式必须符合如下定义的 XPath 子集:

  • 如果任何 XPath 表达式符合出现在 XML 模式中的身份约束中的路径表达式的规则,则它是可接受的。这些规则不允许使用谓词;第一步(但只有第一步)可以用“//”引入;最后一步可以选择使用属性轴;所有其他步骤必须是使用子轴的简单轴步骤。

  • 此外,Saxon 允许表达式包含联合,例如 doc()/(*/ABC | /XYZ)。联合也可以用缩写形式表示,例如上面可以写成 doc()//(ABC|XYZ)。
  • 表达式必须仅选择元素,或仅选择属性,或元素和属性的混合。

  • 还支持简单过滤器(一个或多个)。每个过滤器都可以应用于最后一步或作为一个整体的表达式,并且它只能使用上下文节点(自我、子项、属性、后代、后代或自我或命名空间轴)的向下选择。它不能是位置的(也就是说,它不能引用 position() 或 last(),并且不能是数字:事实上,它必须是 Saxon 可以在编译时确定它不是数字的)。过滤器不能应用于联合或联合的分支。任何违反这些条件的行为都会导致在没有流优化的情况下评估表达式。

  • 在对表达式应用其他优化重写之后,这些规则才适用。例如,某些 FLWOR 表达式可能会被重写为满足这些规则的路径表达式。

  • 仅当明确请求时才启用优化,通过使用 saxon:stream() 扩展函数,或 XSLT xsl:copy-of 指令上的 saxon:read-once 属性,或 XQuery pragma saxon:stream。仅当使用 Saxon-SA 处理样式表或查询时才可用。”

Note: It is most likely in the commercial version this facility is available. I've used Saxon extensively earlier, and it is a nice piece of work.

注意:最有可能在商业版本中提供此功能。我之前广泛使用过 Saxon,这是一项不错的工作。

回答by Colin

Sorry, a slightly late answer here - it seems that this is possible for a subset of XPath - in general it's very difficult due to the fact that XPath can match both forwards and backwards from the "current" point. I'm aware of two projects that solve it to some degree using state machines: http://spex.sourceforge.net& http://www.cs.umd.edu/projects/xsq. I haven't looked at them in detail but they seem to use a similar approach.

抱歉,这里的答案稍微晚了一点 - 似乎这对于 XPath 的一个子集是可能的 - 一般来说,由于 XPath 可以从“当前”点向前和向后匹配这一事实,这非常困难。我知道有两个项目使用状态机在某种程度上解决了这个问题:http://spex.sourceforge.net& http://www.cs.umd.edu/projects/xsq。我没有详细研究过它们,但它们似乎使用了类似的方法。

回答by Simone Gianni

XPath DOES work with SAX, and most XSLT processors (especially Saxon and Apache Xalan) do support executing XPath expressions inside XSLTs on a SAX stream without building the entire dom.

XPath 确实与 SAX 一起工作,并且大多数 XSLT 处理器(尤其是 Saxon 和 Apache Xalan)确实支持在 SAX 流上的 XSLT 内执行 XPath 表达式,而无需构建整个 dom。

They manage to do this, very roughly, as follows :

他们设法做到这一点,非常粗略,如下:

  1. Examining the XPath expressions they need to match
  2. Receiving SAX events and testing if that node is needed or will be needed by one of the XPath expressions.
  3. Ignoring the SAX event if it is of no use for the XPath expressions.
  4. Buffering it if it's needed
  1. 检查他们需要匹配的 XPath 表达式
  2. 接收 SAX 事件并测试某个 XPath 表达式是否需要或将需要该节点。
  3. 如果 SAX 事件对 XPath 表达式没有用,则忽略它。
  4. 如果需要,缓冲它

How they buffer it is also very interesting, cause while some simply create DOM fragments here and there, others use very optimized tables for quick lookup and reduced memory consumption.

他们如何缓冲它也很有趣,因为虽然有些只是在这里和那里创建 DOM 片段,但其他人使用非常优化的表来快速查找并减少内存消耗。

How much they manage to optimize largely depends on the kind of XPath queries they find. As the already posted Saxon documentation clearly explain, queries that move "up" and then traverse "horizontally" (sibling by sibling) the document obviously requires the entire document to be there, but most of them require just a few nodes to be kept into RAM at any moment.

他们设法优化的程度在很大程度上取决于他们找到的 XPath 查询的类型。正如已经发布的 Saxon 文档清楚地解释的那样,查询“向上”移动然后“水平”遍历(兄弟姐妹)文档显然需要整个文档都在那里,但其中大多数只需要保留几个节点内存随时可用。

I'm pretty sure of this because when I was still making every day webapp using Cocoon, we had the XSLT memory footprint problem each time we used a "//something" expression inside an XSLT, and quite often we had to rework XPath expressions to allow a better SAX optimization.

我很确定这一点,因为当我还在每天使用 Cocoon 制作 web 应用程序时,每次我们在 XSLT 中使用“//something”表达式时都会遇到 XSLT 内存占用问题,而且我们经常不得不重新编写 XPath 表达式以允许更好的 SAX 优化。

回答by Andreas Haufler

We regularly parse 1GB+ complex XML files by using a SAX parser which extracts partial DOM trees that can be conveniently queried using XPath. I blogged about it here: http://softwareengineeringcorner.blogspot.com/2012/01/conveniently-processing-large-xml-files.html- Sources are available on github- MIT License.

我们使用 SAX 解析器定期解析 1GB 以上的复杂 XML 文件,该解析器提取部分 DOM 树,可以方便地使用 XPath 进行查询。我在这里写了博客:http: //softwareengineeringcorner.blogspot.com/2012/01/conveniently-processing-large-xml-files.html- 来源可在github 上获得- MIT 许可证。