Java 中的并行 XML 解析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4208584/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parallel XML Parsing in Java
提问by Martin K.
I'm writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox(Event API) to parse a file with 22.000 Nodes.
我正在编写一个应用程序,它处理大量具有深层节点结构的 xml 文件(> 1000)。使用woodstox(事件 API)解析一个包含22.000 个节点的文件大约需要六秒钟。
The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.
该算法被置于用户交互过程中,其中只有几秒钟的响应时间是可以接受的。所以我需要改进如何处理xml文件的策略。
- My process analyses the xml files (extracts only a few nodes).
- Extracted nodes are processed and the new result is written into a new data stream (resulting in a copy of the document with modified nodes).
- 我的过程分析了 xml 文件(仅提取了几个节点)。
- 处理提取的节点并将新结果写入新的数据流(生成带有修改节点的文档副本)。
Now I'm thinking about a multithreaded solution (which scales better on 16 Core+ hardware). I thought about the following stategies:
现在我正在考虑一个多线程解决方案(它在 16 核 + 硬件上的扩展性更好)。我考虑了以下状态:
- Creating multiple parsers and running them in parallel on the xml sources.
- Rewriting my parsing algorithm thread-save to use only one instance of the parser (factories, ...)
- Split the XML source into chunks and assign the chunks to multiple processing threads (map-reduce xml - serial)
- Optimizing my algorithm (better StAX parser than woodstox?) / Using a parser with build-in concurrency
- 创建多个解析器并在 xml 源上并行运行它们。
- 重写我的解析算法线程保存以仅使用解析器的一个实例(工厂,...)
- 将 XML 源分成块并将这些块分配给多个处理线程 ( map-reduce xml - serial)
- 优化我的算法(StAX 解析器比 woodstox 更好?)/使用具有内置并发性的解析器
I want to improve both, the performance overall and the "per file" performance.
我想同时提高整体性能和“每个文件”的性能。
Do you have experience with such problems? What is the best way to go?
您有遇到此类问题的经验吗?最好的方法是什么?
采纳答案by Peter Knego
This one is obvious: just create several parsers and run them in parallel in multiple threads.
Take a look at Woodstox Performance(down at the moment, try google cache).
This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:
<element> <more>more elements</more> </element> <element> <other>other elements</other> </element>
In this case you could create simple splitter that searches
<element>
and feeds this part to a particular parser instance. That's a simplified approach: in real life I'd go with RandomAccessFile to find start stop points (<element>
) and then create custom FileInputStream that just operates on a part of file.Take a look at Aalto. The same guys that created Woodstox. This are experts in this area - don't reinvent the wheel.
这个很明显:只需创建几个解析器并在多个线程中并行运行它们。
看一看Woodstox Performance(目前已关闭,请尝试使用 google 缓存)。
如果您的 XML 的结构是可预测的,则可以这样做:如果它具有许多相同的顶级元素。例如:
<element> <more>more elements</more> </element> <element> <other>other elements</other> </element>
在这种情况下,您可以创建简单的拆分器来搜索
<element>
并将此部分提供给特定的解析器实例。这是一种简化的方法:在现实生活中,我会使用 RandomAccessFile 来查找开始停止点 (<element>
),然后创建仅对文件的一部分进行操作的自定义 FileInputStream。看看阿尔托。那些创造伍德斯托克斯的人。这是该领域的专家 - 不要重新发明轮子。
回答by AlexR
I am agree with Jim. I think that if you want to improve performance of overall processing of 1000 files your plan is good except #3 that is irrelevant in this case. If however you want to improve performance of parsing of single file you have a problem. I do not know how it is possible to split XML file without it parsing. Each chunk will be illegal XML and your parser will fail.
我同意吉姆的看法。我认为,如果你想提高 1000 个文件的整体处理性能,你的计划是好的,除了 #3 在这种情况下无关紧要。但是,如果您想提高解析单个文件的性能,则会遇到问题。我不知道如何在没有解析的情况下拆分 XML 文件。每个块都将是非法的 XML,您的解析器将失败。
I believe that improving overall time is good enough for you. In this case read this tutorial: http://download.oracle.com/javase/tutorial/essential/concurrency/index.htmlthen create thread pool of for example 100 threads and queue that contains XML sources. Each thread will parse only 10 files that will bring serious performance benefit in multi-CPU environment.
我相信提高整体时间对你来说已经足够了。在这种情况下,请阅读本教程:http: //download.oracle.com/javase/tutorial/essential/concurrency/index.html然后创建例如 100 个线程的线程池和包含 XML 源的队列。每个线程将只解析 10 个文件,这将在多 CPU 环境中带来严重的性能优势。
回答by StaxMan
In addition to existing good suggestions there is one rather simple thing to do: use cursor API (XMLStreamReader), NOT Event API. Event API adds 30-50% overhead without (just IMO) significantly making processing easire. In fact, if you want convenience, I would recommend using StaxMateinstead; it builds on top of Cursor API without adding significant overhead (at most 5-10% compared to hand-written code).
除了现有的好建议之外,还有一件相当简单的事情要做:使用游标 API (XMLStreamReader),而不是事件 API。事件 API 增加了 30-50% 的开销,而没有(仅 IMO)显着使处理变得容易。事实上,如果你想要方便,我会推荐使用StaxMate代替;它构建在 Cursor API 之上,而不会增加大量开销(与手写代码相比,最多 5-10%)。
Now: I assume you have done basic optimizations with Woodstox; but if not, check out "3 Simple Rules for Fast XML-processing using Stax". Specifically, you absolutely should:
现在:我假设您已经对 Woodstox 进行了基本优化;但如果没有,请查看“使用 Stax 进行快速 XML 处理的 3 条简单规则”。具体来说,您绝对应该:
- Make sure you only create XMLInputFactory and XMLOutputFactory instances once
- Close readers and writers to ensure buffer recycling (and other useful reuse) works as expected.
- 确保只创建一次 XMLInputFactory 和 XMLOutputFactory 实例
- 关闭读取器和写入器以确保缓冲区回收(和其他有用的重用)按预期工作。
The reason I mention this is that while these make no functional difference (code works as expected) they can make big performance difference; although more so when processing smaller files.
我提到这一点的原因是,虽然这些没有功能差异(代码按预期工作),但它们可以产生很大的性能差异;虽然在处理较小的文件时更是如此。
Running multiple instances does also make sense; although usually with at most 1 thread per core. However you will only get benefit as long as your storage I/O can support such speeds; if disk is the bottleneck this will not help and can in some cases hurt (if disk seeks compete). But it is worth a try.
运行多个实例也有意义;虽然通常每个核心最多有 1 个线程。但是,只有您的存储 I/O 能够支持这样的速度,您才能从中受益;如果磁盘是瓶颈,这将无济于事,并且在某些情况下会受到伤害(如果磁盘寻求竞争)。但值得一试。