在java中解析非常大的XML文档(以及更多)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/355909/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing very large XML documents (and a bit more) in java
提问by Chris R
(All of the following is to be written in Java)
(以下所有内容均用Java编写)
I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:
我必须构建一个应用程序,它将可能非常大的 XML 文档作为输入。文档被加密——不是使用 XMLsec,而是使用我客户预先存在的加密算法——将分三个阶段进行处理:
First, the stream will be decrypted according to the aforementioned algorithm.
首先,流将根据上述算法解密。
Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.
其次,扩展类(由第三方编写到我提供的 API 中)将读取文件的某些部分。读取的数量是不可预测的——特别是它不能保证在文件的头中,但可能出现在 XML 中的任何一点。
Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.
最后,另一个扩展类(相同的交易)将输入 XML 细分为 1..n 个子集文档。这些可能在某些部分与第二个操作处理的文档部分重叠,即:我相信我需要回滚我用来处理这个对象的任何机制。
Here is my question:
这是我的问题:
Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.
有没有办法做到这一点,而无需一次将整个数据读入内存?显然我可以将解密实现为输入流过滤器,但我不确定是否可以按照我描述的方式解析 XML;通过遍历尽可能多的文档来收集第二步的信息,然后通过倒回文档并再次传递它以将其拆分为作业,理想情况下释放文档中不再使用的所有部分他们已经通过了。
回答by Joachim Sauer
You could use a BufferedInputStream
with a very large buffer size and use mark()
before the extension class works and reset()
afterwards.
您可以使用BufferedInputStream
具有非常大缓冲区大小的 a 并mark()
在扩展类工作之前和reset()
之后使用。
If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.
如果扩展类需要的部分在文件中很远,那么这可能会变得非常占用内存,“不过。
A more general solution would be to write your own BufferedInputStream
-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.
BufferedInputStream
如果要缓冲的数据超过某个预设阈值,则更通用的解决方案是将自己的缓冲写入磁盘。
回答by PhiLho
You might be interested by XOM:
您可能对XOM感兴趣:
XOM is fairly unique in that it is a dual streaming/tree-based API. Individual nodes in the tree can be processed while the document is still being built. The enables XOM programs to operate almost as fast as the underlying parser can supply data. You don't need to wait for the document to be completely parsed before you can start working with it.
XOM is very memory efficient. If you read an entire document into memory, XOM uses as little memory as possible. More importantly, XOM allows you to filter documents as they're built so you don't have to build the parts of the tree you aren't interested in. For instance, you can skip building text nodes that only represent boundary white space, if such white space is not significant in your application. You can even process a document piece by piece and throw away each piece when you're done with it. XOM has been used to process documents that are gigabytes in size.
XOM 相当独特,因为它是一个双流/基于树的 API。在文档仍在构建过程中,可以处理树中的各个节点。这使 XOM 程序的运行速度几乎与底层解析器提供的数据一样快。您无需等待文档完全解析即可开始使用它。
XOM 的内存效率很高。如果您将整个文档读入内存,XOM 会使用尽可能少的内存。更重要的是,XOM 允许您在构建文档时对其进行过滤,因此您不必构建您不感兴趣的树部分。例如,您可以跳过构建仅表示边界空白的文本节点,如果这样的空白在您的应用程序中不重要。您甚至可以逐个处理文档,并在处理完后将其丢弃。XOM 已用于处理千兆字节大小的文档。
回答by Guillaume
This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...
这听起来像是 StAX ( JSR 173) 的工作。StAX 是一个拉式解析器,这意味着它或多或少地像 SAX 这样的基于事件的解析器一样工作,但是您可以更好地控制何时停止读取、拉取哪些元素、...
The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...
此解决方案的可用性将在很大程度上取决于您的扩展类实际在做什么,如果您可以控制它们的实现等...
The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.
主要的一点是,如果文档非常大,您可能希望使用基于事件的解析器而不是基于树的解析器,这样您就不会使用大量内存。
Implementations of StAX can be found from SUN (SJSXP), Codehausor a few other providers.
回答by Nick Holt
I would write a custom implementation of InputStream
that decrypts the bytes in the file and then use SAXto parse the resulting XML as it comes off the stream.
我会编写一个自定义实现InputStream
来解密文件中的字节,然后使用SAX解析来自流的结果 XML。
SAXParserFactory.newInstance().newSAXParser().parse(
new DecryptingInputStream(),
new MyHandler()
);
回答by NickV
Look at the XOMlibrary. The example you are looking for is StreamingExampleExtractor.java
in the samples directory of the source distribution. This shows a technique for performing a streaming parse of a large xml document only building specific nodes, processing them and discarding them. It is very similar to a sax approach, but has a lot more parsing capability built in so a streaming parse can be achieved pretty easily.
查看XOM库。您正在寻找StreamingExampleExtractor.java
的示例位于源分发的示例目录中。这显示了一种对大型 xml 文档执行流式解析的技术,仅构建特定节点、处理它们并丢弃它们。它与 sax 方法非常相似,但内置了更多解析功能,因此可以非常轻松地实现流式解析。
If you want to work at higher level look at NUX. This provides a high level streaming xpath API that only reads the amount of data into memory needed to evaluate the xpath.
如果您想在更高级别上工作,请查看NUX。这提供了高级流 xpath API,该 API 仅将评估 xpath 所需的数据量读取到内存中。