java 使用 STaX 将一个 xml 转换为另一个 xml 需要很多时间

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5568806/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 11:42:18  来源:igfitidea点击:

Transforming an xml to another xml with STaX takes a lot of time

javaxmlxsltstax

提问by SpeTIX

I'm using the following code to transform a big xml stream to another stream:

我正在使用以下代码将一个大的 xml 流转换为另一个流:

 import java.io.ByteArrayInputStream;
 import java.io.InputStreamReader;
 import java.io.OutputStreamWriter;
 import java.io.PrintWriter;
 import java.io.Writer;
 import javax.xml.stream.XMLEventReader;
 import javax.xml.stream.XMLEventWriter;
 import javax.xml.stream.XMLInputFactory;
 import javax.xml.stream.XMLOutputFactory;
 import javax.xml.stream.XMLStreamException;
 import javax.xml.stream.XMLStreamReader;
 import javax.xml.stream.events.XMLEvent;
 import javax.xml.transform.Result;
 import javax.xml.transform.Source;
 import javax.xml.transform.Transformer;
 import javax.xml.transform.TransformerFactory;
 import javax.xml.transform.stax.StAXResult;
 import javax.xml.transform.stax.StAXSource;

 public class TryMe 
 {
   public static void main (final String[] args)
   {
    XMLInputFactory inputFactory = null;
    XMLEventReader eventReaderXSL = null;
    XMLEventReader eventReaderXML = null;
    XMLOutputFactory outputFactory = null;
    XMLEventWriter eventWriter = null;
    Source XSL = null;
    Source XML = null;
    inputFactory = XMLInputFactory.newInstance();
    outputFactory = XMLOutputFactory.newInstance();
    inputFactory.setProperty("javax.xml.stream.isSupportingExternalEntities", Boolean.TRUE);
    inputFactory.setProperty("javax.xml.stream.isNamespaceAware", Boolean.TRUE);
    inputFactory.setProperty("javax.xml.stream.isReplacingEntityReferences", Boolean.TRUE);
    try
    {
        eventReaderXSL = inputFactory.createXMLEventReader("my_template",
                new InputStreamReader(TryMe.class.getResourceAsStream("my_template.xsl")));
        eventReaderXML = inputFactory.createXMLEventReader("big_one", new InputStreamReader(
                TryMe.class.getResourceAsStream("big_one.xml")));
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }

    // get a TransformerFactory object
    final TransformerFactory transfFactory = TransformerFactory.newInstance();

    // define the Source object for the stylesheet
    try
    {
        XSL = new StAXSource(eventReaderXSL);
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }
    Transformer tran2 = null;
    // get a Transformer object
    try
    {

        tran2 = transfFactory.newTransformer(XSL);
    }
    catch (final javax.xml.transform.TransformerConfigurationException e)
    {
        System.out.println(e.getMessage());
    }

    // define the Source object for the XML document
    try
    {
        XML = new StAXSource(eventReaderXML);
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }

    // create an XMLEventWriter object
    try
    {

        eventWriter = outputFactory.createXMLEventWriter(new OutputStreamWriter(System.out));
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }

    // define the Result object
    final Result XML_r = new StAXResult(eventWriter);

    // call the transform method
    try
    {

        tran2.transform(XML, XML_r);
    }
    catch (final javax.xml.transform.TransformerException e)
    {
        System.out.println(e.getMessage());
    }

    // clean up
    try
    {
        eventReaderXSL.close();
        eventReaderXML.close();
        eventWriter.close();
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }
}

}

}

my_template is something like this:

my_template 是这样的:

<xsl:stylesheet version = '1.0' 
     xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

<xsl:preserve-space elements="*"/>

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>


<xsl:template match="@k8[parent::point]">
  <xsl:attribute name="k8">
    <xsl:value-of select="'xxxxxxxxxxxxxx'"/>
  </xsl:attribute>
</xsl:template>

</xsl:stylesheet>

and xml is a long long list of

和 xml 是一长串

<data>
  <point .... k8="blablabla" ... ></point>
  <point .... k8="blablabla" ... ></point>
  <point .... k8="blablabla" ... ></point>
  ....
  <point .... k8="blablabla" ... ></point>
</data>

If i use an identity transformer (using tranfsFactory.newTransformer() instead of transFactory(XSL) ) while the input stream is processed the output is produced. Instead with my template there's no way.. The transformer reads all the input and then starts to produce the output (with a large stream of course very often an out of memory comes before a result.

如果我在处理输入流时使用标识转换器(使用 tranfsFactory.newTransformer() 而不是 transFactory(XSL) ),则会产生输出。取而代之的是我的模板没有办法..转换器读取所有输入然后开始生成输出(当然,大流通常会在结果之前出现内存不足。

Any Idea?? i'm freaking out.. i can't understand what's wrong in my code/xslt

任何的想法??我吓坏了..我不明白我的代码/xslt有什么问题

Many thanks in advance!!

提前谢谢了!!

回答by Martin Honnen

Well XSLT 1.0 and 2.0 operate on a tree data model of the complete XML so XSLT 1.0 and 2.0 processors usually read the complete XML input document into a tree and create a result tree that is then serialized. You seem to assume that using StAX changes the behaviour of XSLT but I don't think that is the case, the XSLT processor builds the tree as the stylessheet could require complex XPath navigator like preceding or preceding-sibling.

XSLT 1.0 和 2.0 对完整 XML 的树数据模型进行操作,因此 XSLT 1.0 和 2.0 处理器通常将完整的 XML 输入文档读入一棵树并创建一个结果树,然后将其序列化。您似乎认为使用 StAX 会更改 XSLT 的行为,但我不认为是这种情况,XSLT 处理器会构建树,因为样式表可能需要复杂的 XPath 导航器,如前置或前置同级。

However as you use Java you could look into Saxon 9.3 and its experimental XSLT 3.0 streaming support, that way you should not run out of memory when processing very large XML input documents.

但是,当您使用 Java 时,您可以查看 Saxon 9.3 及其实验性 XSLT 3.0 流媒体支持,这样您在处理非常大的 XML 输入文档时不应该耗尽内存。

The part in your XSLT that is unusual is <xsl:template match="@k8[parent::point]">, that is usually simply written as <xsl:template match="point/@k8">but you would need to test with your XSLT processor whether that changes performance.

XSLT 中不寻常的部分是<xsl:template match="@k8[parent::point]">,通常简单地编写为,<xsl:template match="point/@k8">但是您需要使用 XSLT 处理器测试这是否会改变性能。

回答by AndyT

Using XSLT is probably not the best approach, as others have pointed out your solution requires that the processor reads the entire document into memory before writing out the output. You might wish to consider using a SAX parser to sequentially read in each node, perform any transformation required (using a data driven mapping if necessary) and write out the transformed data. This avoids the requirement to create an entire document tree in memory and could enable significantly faster processing as you're not attempting to build a complex document to write out.

使用 XSLT 可能不是最好的方法,因为其他人已经指出您的解决方案要求处理器在写出输出之前将整个文档读入内存。您可能希望考虑使用 SAX 解析器顺序读取每个节点,执行所需的任何转换(必要时使用数据驱动映射)并写出转换后的数据。这避免了在内存中创建整个文档树的要求,并且可以显着加快处理速度,因为您不会尝试构建复杂的文档来写出。

Ask yourself if the output format is simple and stable, and then reconsider the use of XSLT. For large datasets of regular data, you might also wish to consider if XML is a good file format for transferring information.

问问自己输出格式是否简单稳定,然后再考虑使用XSLT。对于常规数据的大型数据集,您可能还希望考虑 XML 是否是用于传输信息的良好文件格式。

回答by matt b

The transformer reads all the input and then starts to produce the output (with a large stream of course very often an out of memory comes before a result.

Any Idea?

转换器读取所有输入,然后开始产生输出(当然,大流通常会在结果之前出现内存不足。

任何的想法?

If you are finding that it takes too long for this work to complete, then you need to redesign your approach to your task to avoid reading in the entire input file before you start to process the output file. There is nothing that can be tweaked with your code to make it magically faster - you need to address the core of your algorithm.

如果您发现完成这项工作需要很长时间,那么您需要重新设计您的任务方法,以避免在开始处理输出文件之前读取整个输入文件。没有什么可以用您的代码进行调整以使其神奇地更快 - 您需要解决算法的核心问题。

回答by Lucas Bruand

As others have pointed, using Stax won't change the way XSLT is working : It reads first everything before starting any work. If you needto work with very large files, you'll have to use something other than XSLT.

正如其他人指出的那样,使用 Stax 不会改变 XSLT 的工作方式:它在开始任何工作之前首先读取所有内容。如果您需要处理非常大的文件,则必须使用 XSLT 以外的其他东西。

Then are different options:

然后是不同的选择:

回答by sudocode

How complex is the transformation you are doing with XSL? Could you make the same transformation using StAX alone?

您使用 XSL 进行的转换有多复杂?您可以单独使用 StAX 进行相同的转换吗?

With StAX it is quite easy to write a parser to match a particular node and then to insert, alter or remove nodes in the output stream you are writing to at that point. So instead of using XSL for the transform, you could maybe use StAX alone. This way you benefit from the streaming nature of the API (not buffering large tree in memory) and so there will be no memory issue.

使用 StAX,可以很容易地编写一个解析器来匹配一个特定的节点,然后在你正在写入的输出流中插入、更改或删除节点。因此,您可以单独使用 StAX,而不是使用 XSL 进行转换。通过这种方式,您可以从 API 的流式特性中受益(不在内存中缓冲大树),因此不会出现内存问题。

Co-incidentally, thisrecent answer to another question might help you with that.

联合顺便说一句,这个最近回答了另一个问题可能会帮你。

回答by ThomasRS

Try apache xsltcfor better performance - it uses code generation to simply transforms.

尝试apache xsltc以获得更好的性能 - 它使用代码生成来简单地转换。

Your XSLt transform looks really simple, and so does your input format - surely you can do StAX/SAX manual processing and gain a really good performance increase.

您的 XSLt 转换看起来非常简单,您的输入格式也是如此 - 当然您可以进行 StAX/SAX 手动处理并获得非常好的性能提升。