用于大文件的 Java XML 解析器

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3969713/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 07:43:29  来源:igfitidea点击:

Java XML Parser for huge files

javaxmlparsing

提问by mehmet6parmak

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.

我需要一个 xml 解析器来解析一个大约 1.8 gb 的文件。
所以解析器不应该将所有文件加载到内存中。

Any suggestions?

有什么建议?

采纳答案by Tomas Narros

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).

除了推荐的 SAX 解析之外,您还可以使用 JDK(包 javax.xml.stream)中包含的 StAX API(一种 SAX 演变)。

回答by Nick Fortescue

Use almost any SAXParserto stream the file a bit at a time.

使用几乎任何SAX解析器一次一点地流式传输文件。

回答by Nathan Hughes

Stream the file into a SAX parser and read it into memory in chunks.

将文件流式传输到 SAX 解析器中,并将其分块读入内存。

SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

SAX 为您提供了很多控制权,事件驱动是有意义的。api 有点难以掌握,你必须注意一些事情,比如调用 characters() 方法时,但基本思想是你编写一个内容处理程序,在每个开始和结束时调用xml 元素被读取。因此,您可以跟踪文档中的当前 xpath,确定哪些路径具有您感兴趣的数据,并确定哪个路径标记了您想要保存或移交或以其他方式处理的块的结尾。

回答by andrewmu

Use a SAX based parser that presents you with the contents of the document in a stream of events.

使用基于 SAX 的解析器,以事件流的形式向您显示文档的内容。

回答by dogbane

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

试试VTD-XML。我发现它比 SAX 性能更高,更重要的是,它更易于使用。

回答by Will Hartung

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).

正如其他人所说,使用 SAX 解析器,因为它是一个流解析器。使用各种事件,您可以根据需要提取信息,然后将其即时存储在其他地方(数据库、另一个文件、您拥有什么)。

You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.

如果您确实只需要一个较小的子集,或者您只是对文件进行汇总,您甚至可以将其存储在内存中。当然取决于用例。

If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

如果您正在假脱机到数据库,请确保您注意使您的进程可重新启动或其他任何东西。1.8GB 中可能发生很多事情,但中间可能会失败。

回答by Eugene Kuleshov

StAX API is easier to deal with compared to SAX. Here is a short tutorial

与 SAX 相比,StAX API 更容易处理。这是一个简短的教程

回答by Chris W

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

StaX +1。它比 SaX 更容易使用,因为你不需要编写回调(你基本上只是循环 while 的所有元素直到你完成)并且它(AFAIK)对它可以处理的文件大小没有限制.

回答by Adrian Smith

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).

我有一个类似的问题 - 我必须读取整个 XML 文件并在内存中创建一个数据结构。在这个数据结构上(必须加载整个东西),我必须做各种操作。许多 XML 元素包含文本(我必须在我的输出文件中输出这些文本,但对算法来说并不重要)。

FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.

首先,正如这里所建议的,我使用 SAX 来解析文件并构建我的数据结构。我的文件是 4GB,我有一台 8GB 的​​机器,所以我想可能 3GB 的文件只是文本,而 java.lang.String 可能需要 6GB 的文本使用其 UTF-16。

If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.

如果 JVM 占用的空间多于计算机的物理 RAM,则计算机将进行交换。进行标记+清除垃圾收集将导致以随机顺序访问页面,并且对象也会从一个对象池移动到另一个对象池,这基本上会杀死机器。

So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc| Download ZIP.

因此,我决定将所有字符串写入磁盘文件中(FS 显然可以很好地处理 3GB 的顺序写入,并且在操作系统中读取它时将使用可用内存作为文件系统缓存;可能仍然存在是随机访问读取,但少于 java 中的 GC)。我创建了一个小助手类,如果对您有帮助,欢迎您下载:StringsFile javadoc| 下载 ZIP

StringsFile file = new StringsFile();
StringInFile str = file.newString("abc");        // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file