Java JAXB 可以分块解析大型 XML 文件吗

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1134189/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 00:08:15  来源:igfitidea点击:

Can JAXB parse large XML files in chunks

javajaxb

提问by John F.

I need to parse potentially large XML files, of which the schema is already provided to me in several XSD files, so XML binding is highly favored. I'd like to know if I can use JAXB to parse the file in chunks and if so, how.

我需要解析可能很大的 XML 文件,其中的架构已经在几个 XSD 文件中提供给我,因此 XML 绑定非常受欢迎。我想知道是否可以使用 JAXB 分块解析文件,如果可以,如何解析。

回答by skaffman

This is detailed in the user guide. The JAXB download from http://jaxb.java.net/includes an example of how to parse one chunk at a time.

这在用户指南中有详细说明。从http://jaxb.java.net/下载的 JAXB包括如何一次解析一个块的示例。

When a document is large, it's usually because there's repetitive parts in it. Perhaps it's a purchase order with a large list of line items, or perhaps it's an XML log file with large number of log entries.

This kind of XML is suitable for chunk-processing; the main idea is to use the StAX API, run a loop, and unmarshal individual chunks separately. Your program acts on a single chunk, and then throws it away. In this way, you'll be only keeping at most one chunk in memory, which allows you to process large documents.

See the streaming-unmarshalling example and the partial-unmarshalling example in the JAXB RI distribution for more about how to do this. The streaming-unmarshalling example has an advantage that it can handle chunks at arbitrary nest level, yet it requires you to deal with the push model --- JAXB unmarshaller will "push" new chunk to you and you'll need to process them right there.

In contrast, the partial-unmarshalling example works in a pull model (which usually makes the processing easier), but this approach has some limitations in databinding portions other than the repeated part.

当文档很大时,通常是因为其中有重复的部分。可能是包含大量行项目的采购订单,或者可能是包含大量日志条目的 XML 日志文件。

这种 XML 适用于块处理;主要思想是使用 StAX API,运行循环,并分别解组单个块。您的程序作用于单个块,然后将其丢弃。通过这种方式,您最多只能在内存中保留一个块,这样您就可以处理大型文档。

有关如何执行此操作的更多信息,请参阅 JAXB RI 分发中的流解组示例和部分解组示例。流式解组示例的优点是它可以在任意嵌套级别处理块,但它需要您处理推送模型 --- JAXB 解组器会将新块“推送”给您,您需要正确处理它们那里。

相比之下,部分解组示例在拉模型中工作(这通常使处理更容易),但这种方法在数据绑定部分而不是重复部分有一些限制。

回答by yves amsellem

Because code matters, here is a PartialUnmarshallerwho reads a big file into chunks. It can be used that way new PartialUnmarshaller<YourClass>(stream, YourClass.class)

因为代码很重要,这里有一个PartialUnmarshaller将大文件读成块的人。它可以这样使用new PartialUnmarshaller<YourClass>(stream, YourClass.class)

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.*;
import java.io.InputStream;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

import static javax.xml.stream.XMLStreamConstants.*;

public class PartialUnmarshaller<T> {
    XMLStreamReader reader;
    Class<T> clazz;
    Unmarshaller unmarshaller;

    public PartialUnmarshaller(InputStream stream, Class<T> clazz) throws XMLStreamException, FactoryConfigurationError, JAXBException {
        this.clazz = clazz;
        this.unmarshaller = JAXBContext.newInstance(clazz).createUnmarshaller();
        this.reader = XMLInputFactory.newInstance().createXMLStreamReader(stream);

        /* ignore headers */
        skipElements(START_DOCUMENT, DTD);
        /* ignore root element */
        reader.nextTag();
        /* if there's no tag, ignore root element's end */
        skipElements(END_ELEMENT);
    }

    public T next() throws XMLStreamException, JAXBException {
        if (!hasNext())
            throw new NoSuchElementException();

        T value = unmarshaller.unmarshal(reader, clazz).getValue();

        skipElements(CHARACTERS, END_ELEMENT);
        return value;
    }

    public boolean hasNext() throws XMLStreamException {
        return reader.hasNext();
    }

    public void close() throws XMLStreamException {
        reader.close();
    }

    void skipElements(int... elements) throws XMLStreamException {
        int eventType = reader.getEventType();

        List<Integer> types = asList(elements);
        while (types.contains(eventType))
            eventType = reader.next();
    }
}

回答by James Watkins

Yves Amsellem's answer is pretty good, but only works if all elements are of exactly the same type. Otherwise your unmarshall will throw an exception, but the reader will have already consumed the bytes, so you would be unable to recover. Instead, we should follow Skaffman's advice and look at the sample from the JAXB jar.

Yves Amsellem 的回答非常好,但只有在所有元素的类型完全相同时才有效。否则你的解组会抛出异常,但读取器已经消耗了字节,所以你将无法恢复。相反,我们应该遵循 Skaffman 的建议并查看 JAXB jar 中的示例。

To explain how it works:

解释它是如何工作的:

  1. Create a JAXB unmarshaller.
  2. Add a listener to the unmarshaller for intercepting the appropriate elements. This is done by "hacking" the ArrayList to ensure the elements are not stored in memory after being unmarshalled.
  3. Create a SAX parser. This is where the streaming happens.
  4. Use the unmarshaller to generate a handler for the SAX parser.
  5. Stream!
  1. 创建 JAXB 解组器。
  2. 向解组器添加侦听器以拦截适当的元素。这是通过“破解” ArrayList 以确保元素在解组后不会存储在内存中来完成的。
  3. 创建 SAX 解析器。这就是流媒体发生的地方。
  4. 使用解组器为 SAX 解析器生成处理程序。
  5. 溪流!

I modified the solution to be generic*. However, it required some reflection. If this is not OK, please look at the code samples in the JAXB jars.

我将解决方案修改为通用*。然而,它需要一些反思。如果这不行,请查看 JAXB jars 中的代码示例。

ArrayListAddInterceptor.java

ArrayListAddInterceptor.java

import java.lang.reflect.Field;
import java.util.ArrayList;

public class ArrayListAddInterceptor<T> extends ArrayList<T> {
    private static final long serialVersionUID = 1L;

    private AddInterceptor<T> interceptor;

    public ArrayListAddInterceptor(AddInterceptor<T> interceptor) {
        this.interceptor = interceptor;
    }

    @Override
    public boolean add(T t) {
        interceptor.intercept(t);
        return false;
    }

    public static interface AddInterceptor<T> {
        public void intercept(T t);
    }

    public static void apply(AddInterceptor<?> interceptor, Object o, String property) {
        try {
            Field field = o.getClass().getDeclaredField(property);
            field.setAccessible(true);
            field.set(o, new ArrayListAddInterceptor(interceptor));
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

}

Main.java

主程序

public class Main {
  public void parsePurchaseOrders(AddInterceptor<PurchaseOrder> interceptor, List<File> files) {
        try {
            // create JAXBContext for the primer.xsd
            JAXBContext context = JAXBContext.newInstance("primer");

            Unmarshaller unmarshaller = context.createUnmarshaller();

            // install the callback on all PurchaseOrders instances
            unmarshaller.setListener(new Unmarshaller.Listener() {
                public void beforeUnmarshal(Object target, Object parent) {
                    if (target instanceof PurchaseOrders) {
                        ArrayListAddInterceptor.apply(interceptor, target, "purchaseOrder");
                    }
                }
            });

            // create a new XML parser
            SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setNamespaceAware(true);
            XMLReader reader = factory.newSAXParser().getXMLReader();
            reader.setContentHandler(unmarshaller.getUnmarshallerHandler());

            for (File file : files) {
                reader.parse(new InputSource(new FileInputStream(file)));
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

*This code has not been tested and is for illustrative purposes only.

*此代码未经测试,仅供说明之用。