用 JAVA 解析大型 XML 文档

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15132390/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 18:35:37  来源:igfitidea点击:

Parsing large XML documents in JAVA

javaxmlsqlitexml-parsing

提问by cgval

I have the following problem:

我有以下问题:

I've got an XML file (approx 1GB), and have to iterate up and down (i.e. not sequential; one after the other) in order to get the required data and do some operations on it. Initially, I used the DOM Java package, but obviously, while parsing through the XML file, the JVM reaches its maximum heap space and halted.

我有一个 XML 文件(大约 1GB),并且必须上下迭代(即不是顺序的;一个接一个)以获取所需的数据并对其进行一些操作。最初,我使用了 DOM Java 包,但很明显,在解析 XML 文件时,JVM 达到其最大堆空间并停止。

In order to overcome this problem, one of the solutions I came up with, was to find another parser that iterates each element in the XML and then I store it's contents in a temporary SQLite Database on my Hard disk. Hence, in this way, the JVM's heap is not exceeded, and once all data is filled, I ignore the XML file and continue my operations on the temporary SQLite Database.

为了克服这个问题,我想出的解决方案之一是找到另一个解析器来迭代 XML 中的每个元素,然后将其内容存储在硬盘上的临时 SQLite 数据库中。因此,通过这种方式,不会超出 JVM 的堆,一旦所有数据都填满,我将忽略 XML 文件并继续对临时 SQLite 数据库进行操作。

Is there another way how I can tackle my problem in hand?

还有另一种方法可以解决我手头的问题吗?

回答by hthserhs

SAX (Simple API for XML)will help you here.

SAX(Simple API for XML)将在这里您提供帮助。

Unlike the DOM parser, the SAX parser does not create an in-memory representation of the XML document and so is faster and uses less memory. Instead, the SAX parser informs clients of the XML document structure by invoking callbacks, that is, by invoking methods on a org.xml.sax.helpers.DefaultHandlerinstance provided to the parser.

与 DOM 解析器不同,SAX 解析器不创建 XML 文档的内存表示,因此速度更快,使用的内存更少。相反,SAX 解析器通过调用回调,即通过调用org.xml.sax.helpers.DefaultHandler提供给解析器的实例上的方法,将 XML 文档结构通知客户端 。

Here is an example implementation:

这是一个示例实现:

SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
DefaultHandler handler = new MyHandler();
parser.parse("file.xml", handler);

Where in MyHandleryou define the actions to be taken when events like start/end of document/element are generated.

在哪里MyHandler定义生成文档/元素的开始/结束等事件时要采取的操作。

class MyHandler extends DefaultHandler {

    @Override
    public void startDocument() throws SAXException {
    }

    @Override
    public void endDocument() throws SAXException {
    }

    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes attributes) throws SAXException {
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
    }

    // To take specific actions for each chunk of character data (such as
    // adding the data to a node or buffer, or printing it to a file).
    @Override
    public void characters(char ch[], int start, int length)
            throws SAXException {
    }

}

回答by gaborsch

If you don't want to be bound by the memory limits, I certainly recommend you to use your current approach, and store everything in database.

如果您不想受到内存限制的约束,我当然建议您使用当前的方法,并将所有内容存储在数据库中。

The parsing of the XML file should be done by a SAX parser, as everybody has recommended (including me). This way you can create one object at a time, and you can immediately persist it into the database.

SAX parser正如每个人(包括我)所推荐的那样,XML 文件的解析应该由 a 完成。通过这种方式,您可以一次创建一个对象,并且可以立即将其持久化到数据库中。

For the post-processing (resolving cross-references), you can use SELECTs from the database, make primary keys, indexes, etc. You can use ORM (Eclipselink, Hibernate) as well if you feel comfortable with that.

对于后处理(解析交叉引用),您可以使用SELECT数据库中的 s,制作主键、索引等。如果您觉得合适,您也可以使用 ORM(Eclipselink、Hibernate)。

Actually I don't really recommend SQLite, it's easier to set up a MySQL server, and store the data there. Later you can even reuse the XML data (if you don't delete).

实际上我并不真正推荐 SQLite,它更容易设置 MySQL 服务器并将数据存储在那里。稍后您甚至可以重用 XML 数据(如果您不删除)。

回答by Michael Kay

If you want to use a higher-level approach than SAX, which can be very tricky to program, you could look at streaming XSLT transformations using a recent Saxon-EE release. However, you've been too vague about the precise processing that you are doing to know whether this will work for your particular case.

如果您想使用比 SAX 更高级别的方法,这可能非常难以编程,您可以使用最近的 Saxon-EE 版本查看流式 XSLT 转换。但是,您对正在执行的精确处理过于含糊,无法知道这是否适用于您的特定情况。

回答by dexter

if you require a resource friendly approach to handle very large xml try this: http://www.xml2java.net/xml-to-java-data-binding-for-big-data/it allows you to process data in a SAX way, but with the advantage of getting high level events (xml data mapped onto java) and being able to work with these objects in your code directly. so it combines jaxb convenience and SAX resource friendlyness.

如果您需要一种资源友好的方法来处理非常大的 xml,请尝试以下操作:http: //www.xml2java.net/xml-to-java-data-binding-for-big-data/它允许您在 SAX 中处理数据方式,但具有获取高级事件(映射到 java 的 xml 数据)并能够直接在代码中使用这些对象的优势。所以它结合了 jaxb 的便利性和 SAX 资源友好性。