如何在 Java 中解析大 (50 GB) XML 文件

Question

提问by Joe Maher

Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements.

目前我正在尝试使用 SAX 解析器，但大约 3/4 的文件完全冻结，我尝试分配更多内存等，但没有得到任何改进。

Is there any way to speed this up? A better method?

有没有办法加快这个速度？更好的方法？

Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like.

将它剥离到裸露的骨头，所以我现在有以下代码，当在命令行中运行时，它仍然没有我想要的那么快。

Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000

使用“java -Xms-4096m -Xmx8192m -jar reader.jar”运行它，我在文章 700000 周围超出了 GC 开销限制

Main:

主要的：

public class Read {
    public static void main(String[] args) {       
       pages = XMLManager.getPages();
    }
}

XMLManager

XML管理器

public class XMLManager {
    public static ArrayList<Page> getPages() {

    ArrayList<Page> pages = null; 
    SAXParserFactory factory = SAXParserFactory.newInstance();

    try {

        SAXParser parser = factory.newSAXParser();
        File file = new File("..\enwiki-20140811-pages-articles.xml");
        PageHandler pageHandler = new PageHandler();

        parser.parse(file, pageHandler);
        pages = pageHandler.getPages();

    } catch (ParserConfigurationException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }


    return pages;
    }    
}

PageHandler

页面处理程序

public class PageHandler extends DefaultHandler{

    private ArrayList<Page> pages = new ArrayList<>();
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(){
        super();
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {

        stringBuilder = new StringBuilder();

         if (qName.equals("page")){

            page = new Page();
            idSet = false;

        } else if (qName.equals("redirect")){
             if (page != null){
                 page.setRedirecting(true);
             }
        }
    }

     @Override
     public void endElement(String uri, String localName, String qName) throws SAXException {

         if (page != null && !page.isRedirecting()){

             if (qName.equals("title")){

                 page.setTitle(stringBuilder.toString());

             } else if (qName.equals("id")){

                 if (!idSet){

                     page.setId(Integer.parseInt(stringBuilder.toString()));
                     idSet = true;

                 }

             } else if (qName.equals("text")){

                 String articleText = stringBuilder.toString();

                 articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>", " "); //remove references
                 articleText = articleText.replaceAll("(?s)\{\{(.+?)\}\}", " "); //remove links underneath headings
                 articleText = articleText.replaceAll("(?s)==See also==.+", " "); //remove everything after see also
                 articleText = articleText.replaceAll("\|", " "); //Separate multiple links
                 articleText = articleText.replaceAll("\n", " "); //remove new lines
                 articleText = articleText.replaceAll("[^a-zA-Z0-9- \s]", " "); //remove all non alphanumeric except dashes and spaces
                 articleText = articleText.trim().replaceAll(" +", " "); //convert all multiple spaces to 1 space

                 Pattern pattern = Pattern.compile("([\S]+\s*){1,75}"); //get first 75 words of text
                 Matcher matcher = pattern.matcher(articleText);
                 matcher.find();

                 try {
                     page.setSummaryText(matcher.group());
                 } catch (IllegalStateException se){
                     page.setSummaryText("None");
                 }
                 page.setText(articleText);

             } else if (qName.equals("page")){

                 pages.add(page);
                 page = null;

            }
        } else {
            page = null;
        }
     }

     @Override
     public void characters(char[] ch, int start, int length) throws SAXException {
         stringBuilder.append(ch,start, length); 
     }

     public ArrayList<Page> getPages() {
         return pages;
     }
}

Answer 1

采纳答案by Don Roby

Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList.

您的解析代码可能工作正常，但您加载的数据量可能太大而无法保存在内存中ArrayList。

You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once.

您需要某种管道将数据传递到其实际目的地，而不必一次将其全部存储在内存中。

What I've sometimes done for this sort of situation is similar to the following.

我有时为这种情况所做的类似于以下内容。

Create an interface for processing a single element:

创建用于处理单个元素的接口：

public interface PageProcessor {
    void process(Page page);
}

Supply an implementation of this to the PageHandlerthrough a constructor:

PageHandler通过构造函数向提供 this 的实现：

public class Read  {
    public static void main(String[] args) {

        XMLManager.load(new PageProcessor() {
            @Override
            public void process(Page page) {
                // Obviously you want to do something other than just printing, 
                // but I don't know what that is...
                System.out.println(page);
           }
        }) ;
    }

}


public class XMLManager {

    public static void load(PageProcessor processor) {
        SAXParserFactory factory = SAXParserFactory.newInstance();

        try {

            SAXParser parser = factory.newSAXParser();
            File file = new File("pages-articles.xml");
            PageHandler pageHandler = new PageHandler(processor);

            parser.parse(file, pageHandler);

        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

Send data to this processor instead of putting it in the list:

将数据发送到此处理器而不是将其放入列表中：

public class PageHandler extends DefaultHandler {

    private final PageProcessor processor;
    private Page page;
    private StringBuilder stringBuilder;
    private boolean idSet = false;

    public PageHandler(PageProcessor processor) {
        this.processor = processor;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
         //Unchanged from your implementation
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
            //  Elide code not needing change

            } else if (qName.equals("page")){

                processor.process(page);
                page = null;

            }
        } else {
            page = null;
        }
    }

}

Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandlercollect pages locally in a smaller list and periodically send the list off for processing and clear the list.

当然，您可以让您的界面处理多条记录的块，而不仅仅是一个，并将PageHandler收集页面本地放在一个较小的列表中，并定期发送列表进行处理并清除列表。

Or (perhaps better) you could implement the PageProcessorinterface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.

或者（也许更好）您可以实现PageProcessor此处定义的接口并在此处构建逻辑以缓冲数据并将其发送以进行进一步处理。

Answer 2

回答by dexter

Don Roby's approach is somewhat reminiscent to the approach I followed creating a code generator designed to solve this particular problem (an early version was conceived in 2008). Basically each complexTypehas its Java POJOequivalent and handlers for the particular type are activated when the context changes to that element. I used this approach for SEPA, transaction banking and for instance discogs (30GB). You can specify what elements you want to process at runtime, declaratively using a propeties file.

Don Roby 的方法有点让人想起我所遵循的方法，该方法创建了一个旨在解决这个特定问题的代码生成器（一个早期版本是在 2008 年构思的）。基本上每个complexType都有其Java POJO等效项，并且当上下文更改为该元素时，会激活特定类型的处理程序。我将这种方法用于 SEPA、交易银行和例如 discogs (30GB)。您可以使用属性文件以声明方式指定要在运行时处理的元素。

XML2J uses mapping of complexTypesto Java POJOs on the one hand, but lets you specify events you want to listen on. E.g.

XML2J 一方面使用complexTypes到 Java POJO 的映射，但允许您指定要侦听的事件。例如

account/@process = true
account/accounts/@process = true
account/accounts/@detach = true

The essence is in the third line. The detach makes sure individual accounts are not added to the accounts list. So it won't overflow.

本质在第三行。分离确保个人帐户不会添加到帐户列表中。所以它不会溢出。

class AccountType {
    private List<AccountType> accounts = new ArrayList<>();

    public void addAccount(AccountType tAccount) {
        accounts.add(tAccount);
    }
    // etc.
};

In your code you need to implement the process method (by default the code generator generates an empty method:

在您的代码中，您需要实现 process 方法（默认情况下，代码生成器生成一个空方法：

class AccountsProcessor implements MessageProcessor {
    static private Logger logger = LoggerFactory.getLogger(AccountsProcessor.class);

    // assuming Spring data persistency here
    final String path = new ClassPathResource("spring-config.xml").getPath();
    ClassPathXmlApplicationContext context = new   ClassPathXmlApplicationContext(path);
    AccountsTypeRepo repo = context.getBean(AccountsTypeRepo.class);


    @Override
    public void process(XMLEvent evt, ComplexDataType data)
        throws ProcessorException {

        if (evt == XMLEvent.END) {
            if( data instanceof AccountType) {
                process((AccountType)data);
            }
        }
    }

    private void process(AccountType data) {
        if (logger.isInfoEnabled()) {
            // do some logging
        }
        repo.save(data);
    }
}

Note that XMLEvent.ENDmarks the closing tag of an element. So, when you are processing it, it is complete. If you have to relate it (using a FK) to its parent object in the database, you could process the XMLEvent.BEGINfor the parent, create a placeholder in the database and use its key to store with each of its children. In the final XMLEvent.ENDyou would then update the parent.

请注意，XMLEvent.END标记元素的结束标记。所以，当你处理它时，它是完整的。如果您必须将它（使用 FK）与其在数据库中的父对象相关联，您可以为父对象处理，在数据库中XMLEvent.BEGIN创建一个占位符并使用它的键来存储它的每个子对象。在决赛中，XMLEvent.END您将更新父级。

Note that the code generator generates everything you need. You just have to implement that method and of course the DB glue code.

请注意，代码生成器会生成您需要的一切。您只需要实现该方法，当然还有 DB 粘合代码。

There are samples to get you started. The code generator even generates your POM files, so you can immediately after generation build your project.

有一些示例可以帮助您入门。代码生成器甚至会生成您的 POM 文件，因此您可以在生成后立即构建您的项目。

The default process method is like this:

默认的处理方法是这样的：

@Override
public void process(XMLEvent evt, ComplexDataType data)
    throws ProcessorException {


/*
 *  TODO Auto-generated method stub implement your own handling here.
 *  Use the runtime configuration file to determine which events are to be sent to the processor.
 */ 

    if (evt == XMLEvent.END) {
        data.print( ConsoleWriter.out );
    }
}

Downloads:

下载：

First mvn clean installthe core (it has to be in the local maven repo), then the generator. And don't forget to set up the environment variable XML2J_HOMEas per directions in the usermanual.

首先mvn clean install是核心（它必须在本地 maven 存储库中），然后是生成器。并且不要忘记按照用户手册中的说明设置环境变量XML2J_HOME。

如何在 Java 中解析大 (50 GB) XML 文件

提问by Joe Maher

采纳答案by Don Roby

回答by dexter

相关推荐

最近更新

标签

如何在 Java 中解析大 (50 GB) XML 文件

提问by Joe Maher

采纳答案by Don Roby

回答by dexter

相关推荐

Java 如何创建具有多种视图类型的 RecyclerView？

Java 找不到符号类意图

Java素数法

RxJava 并行获取 Observable

相关推荐

最近更新

标签