什么是最高效的基于 Java 的流式 XSLT 处理器？

Question

提问by Vihung

I have a very large XML file which I need to transform into another XML file, and I would like to do this with XSLT. I am more interested in optimisation for memory, rather than optimisation for speed (though, speed would be good too!).

我有一个非常大的 XML 文件，我需要将其转换为另一个 XML 文件，我想使用 XSLT 来完成此操作。我对内存优化更感兴趣，而不是速度优化（不过，速度也会很好！）。

Which Java-based XSLT processor would you recommmend for this task?

对于此任务，您会推荐哪种基于 Java 的 XSLT 处理器？

Would you recommend any other way of doing it (non-XSLT?, non-Java?), and if so, why?

您会推荐其他任何方式吗（非 XSLT？，非 Java？），如果是，为什么？

The XML files in questions are very large, but not very deep - with millions of rows (elements), but only about 3 levels deep.

问题中的 XML 文件非常大，但不是很深——有数百万行（元素），但只有大约 3 级深。

Answer 1

回答by Dimitre Novatchev

At present there are only three XSLT 2.0processors knownand from them Saxon 9.xis probably the most efficient (at least according to my experience) both in speed and in memory utilisation. Saxon-SA(the schema-aware version of Saxon, not free as the B (basic) version) has special extensions for streamed processing.

目前已知的XSLT 2.0处理器只有三个，其中Saxon 9.x可能是速度和内存利用率最高的（至少根据我的经验）。Saxon-SA（Saxon的模式感知版本，不像 B（基本）版本那样免费）具有用于流处理的特殊扩展。

From the various existingXSLT 1.0processors, .NET XslCompiledTransform(C#-based, not Java!) seems to be the champion.

从各种现有的XSLT 1.0处理器来看，.NET XslCompiledTransform（基于 C#，而不是 Java！）似乎是冠军。

In the Java-based world of XSLT 1.0 processorsSaxon 6.xagain is pretty good.

在 XSLT 1.0 处理器的基于 Java 的世界中，Saxon 6.x再次表现出色。

UPDATE:

更新：

Now, more than 3 years from the date this question was originally answered, there isn't any evidence that the efficiency difference between of the XSLT processors mentioned has changed.

现在，从最初回答这个问题之日起 3 年多，没有任何证据表明所提到的 XSLT 处理器之间的效率差异发生了变化。

As for streaming:

至于流媒体：

An XML document with "millions of nodes" may well be processed even without any streaming. I conducted an experiment in which Saxom 9.1.07 processed an XML document that contains around one million 3-rd level elements with integer values. The transformation simply calculates their sum. The total time for the transformation on my computer is less than 1.5 seconds. The used memory was 500MB -- something that PCs could have even 10 years ago,

即使没有任何流式处理，也可以很好地处理具有“数百万个节点”的 XML 文档。我进行了一项实验，其中 Saxom 9.1.07 处理了一个 XML 文档，该文档包含大约一百万个具有整数值的 3 级元素。转换只是计算它们的总和。在我的电脑上转换的总时间不到 1.5 秒。使用的内存是 500MB——这在 10 年前的 PC 上都可以拥有，

Here are Saxon's informational messages that show details about the transformation:

以下是 Saxon 的信息性消息，其中显示了有关转换的详细信息：

Saxon 9.1.0.7J from Saxonica
Java version 1.6.0_17
Stylesheet compilation time: 190 milliseconds
Processing file:/C:\temp\delete\MRowst.xml
Building tree for file:/C:\temp\delete\MRowst.xml using class net.sf.saxon.tinytree.TinyBuilder
Tree built in 1053 milliseconds
Tree size: 3075004 nodes, 1800000 characters, 0 attributes
Loading net.sf.saxon.event.MessageEmitter
Execution time: 1448 milliseconds
Memory used: 506661648
NamePool contents: 14 entries in 14 chains. 6 prefixes, 6 URIs

Saxon 9.1.0.7J from Saxonica
Java version 1.6.0_17
Stylesheet compilation time: 190 milliseconds
Processing file:/C:\temp\delete\MRowst.xml
Building tree for file:/C:\temp\delete\MRowst.xml using class net.sf.saxon.tinytree.TinyBuilder
Tree built in 1053 milliseconds
Tree size: 3075004 nodes, 1800000 characters, 0 attributes
Loading net.sf.saxon.event.MessageEmitter
Execution time: 1448 milliseconds
Memory used: 506661648
NamePool contents: 14 entries in 14 chains. 6 prefixes, 6 URIs

Saxon 9.4 hasa saxon:stream() extension functionthat can be used for processing huge XML documents.

Saxon 9.4 有一个saxon:stream() 扩展函数，可用于处理巨大的 XML 文档。

Here is an excerpt from the documentation:

以下是文档的摘录：

There are basically two ways of doing streaming in Saxon:
Burst-mode streaming: with this approach, the transformation of a large file is broken up into a sequence of transformations of small pieces of the file. Each piece in turn is read from the input, turned into a small tree in memory, transformed, and written to the output file.
This approach works well for files that are fairly flat in structure, for example a log file holding millions of log records, where the processing of each log record is independent of the ones that went before.
A variant of this technique uses the new XSLT 3.0 xsl:iterate instruction to iterate over the records, in place of xsl:for-each. This allows working data to be maintained as the records are processed: this makes it possible, for example, to output totals or averages at the end of the run, or to make the processing of one record dependent on what came before it in the file. The xsl:iterate instruction also allows early exit from the loop, which makes it possible for a transformation to process data from the beginning of a large file without actually reading the whole file.
Burst-mode streaming is available in both XSLT and XQuery, but there is no equivalent in XQuery to the xsl:iterate construct.
Streaming templates: this approach follows the traditional XSLT processing pattern of performing a recursive descent of the input XML hierarchy by matching template rules to the nodes at each level, but does so one element at a time, without building the tree in memory.
Every template belongs to a mode (perhaps the default, unnamed mode), and streaming is a property of the mode that can be specified using the new xsl:mode declaration. If the mode is declared to be streamable, then every template rule within that mode must obey the rules for streamable processing.
The rules for what is allowed in streamed processing are quite complicated, but the essential principle is that the template rule for a given node can only read the descendants of that node once, in order. There are further rules imposed by limitations in the current Saxon implementation: for example, although grouping using is theoretically consistent with a streamed implementation, it is not currently implemented in Saxon.

在 Saxon 中基本上有两种方法可以进行流传输：
突发模式流：使用这种方法，大文件的转换被分解为文件小块的一系列转换。每一块依次从输入中读取，变成内存中的一棵小树，转换并写入输出文件。
这种方法适用于结构相当扁平的文件，例如一个包含数百万条日志记录的日志文件，其中每个日志记录的处理独立于之前的日志记录。
这种技术的一个变体使用新的 XSLT 3.0 xsl:iterate 指令来迭代记录，代替 xsl:for-each。这允许在处理记录时维护工作数据：例如，这使得在运行结束时输出总计或平均值成为可能，或者使一个记录的处理依赖于文件中它之前的内容. xsl:iterate 指令还允许提前退出循环，这使得转换可以从大文件的开头处理数据，而无需实际读取整个文件。
突发模式流在 XSLT 和 XQuery 中都可用，但在 XQuery 中没有与 xsl:iterate 结构等效的结构。
流模板：这种方法遵循传统的 XSLT 处理模式，通过将模板规则与每个级别的节点匹配来执行输入 XML 层次结构的递归下降，但一次只执行一个元素，而无需在内存中构建树。
每个模板都属于一种模式（可能是默认的未命名模式），而流是该模式的一个属性，可以使用新的 xsl:mode 声明来指定。如果模式被声明为可流式处理，则该模式中的每个模板规则都必须遵守可流式处理的规则。
流处理中允许的规则相当复杂，但基本原则是给定节点的模板规则只能按顺序读取该节点的后代一次。当前 Saxon 实现中的限制强加了更多规则：例如，尽管分组使用在理论上与流式实现一致，但目前在 Saxon 中并未实现。

XSLT 3.0would have standard streaming feature. However, the W3C document is still with a "working draft" status and the streaming specification is likely to change in subsequent draft versions. Due to this, no implementations of the current draft (streaming) specification exist.
Warning: Not every transformation can be performed in streaming mode -- regardless of the XSLT processor. One example of a transformation that isn't possible to perform in a streaming mode (with a limited amount of RAM) for huge documents is sorting their elements (say by a common attribute).

XSLT 3.0将具有标准的流功能。但是，W3C 文档仍处于“工作草案”状态，流规范可能会在后续的草案版本中发生变化。因此，不存在当前草案（流）规范的实现。
警告：并非所有转换都可以在流模式下执行——无论 XSLT 处理器如何。对于大型文档，无法在流模式（RAM 量有限）中执行的转换的一个示例是对其元素进行排序（例如通过公共属性）。

Answer 2

回答by Stephen Denne

You could consider STX, whose Java implementation is Joost. Since it is similar to XSLT, but being a stream processor it is able to process enormous files using very little RAM.

您可以考虑STX，其 Java 实现是Joost。由于它类似于 XSLT，但作为一个流处理器，它能够使用很少的 RAM 处理大量文件。

Joost is able to be used as a standard javax.xml.transform.TransformerFactory

Joost 可以用作标准 javax.xml.transform.TransformerFactory

Answer 3

回答by Peter ?tibrany

See Saxon support for streaming mode. http://www.saxonica.com/html/documentation/sourcedocs/streaming/

请参阅 Saxon 对流模式的支持。http://www.saxonica.com/html/documentation/sourcedocs/streaming/

If this streaming mode isn't for you, you can try to use tiny tree modeof Saxon, which is optimized for smaller memory usage. (It is default anyway)

如果这种流模式不适合您，您可以尝试使用Saxon 的小树模式，该模式针对较小的内存使用进行了优化。（反正都是默认的）

什么是最高效的基于 Java 的流式 XSLT 处理器？

提问by Vihung

回答by Dimitre Novatchev

回答by Stephen Denne

回答by Peter ?tibrany

相关推荐

最近更新

标签

什么是最高效的基于 Java 的流式 XSLT 处理器？

提问by Vihung

回答by Dimitre Novatchev

回答by Stephen Denne

回答by Peter ?tibrany

相关推荐

Java Applet 可以使用打印机吗？

在 Java 中序列化对象时出现 StackOverflowError

Java 相当于 C# 的 TimeSpan

java Java按扩展名对文件名的字符串数组进行排序

相关推荐

最近更新

标签