如何使用 Java HTMLParser 库解析大型 HTML 文件

Question

提问by Sergio del Amo

I have some html files created by Filemaker export. Each file is basically a huge HTML table. I want to iterate through the table rows and populate them into a database. I have tried to do it with HTMLParseras follows:

我有一些由 Filemaker 导出创建的 html 文件。每个文件基本上都是一个巨大的 HTML 表格。我想遍历表行并将它们填充到数据库中。我尝试使用HTMLParser进行如下操作：

String inputHTML = readFile("filemakerExport.htm","UTF-8");
Parser parser = new Parser();
parser.setInputHTML(inputHTML);
parser.setEncoding("UTF-8");    
NodeList nl = parser.parse(null); 
NodeList trs = nl.extractAllNodesThatMatch(new TagNameFilter("tr"),true);
for(int i=0;i<trs.size();i++) {
    NodeList nodes = trs.elementAt(i).getChildren();
    NodeList tds  = nodes.extractAllNodesThatMatch(new TagNameFilter("td"),true);
    // Do stuff with tds
}

The above code works with files under 1 Mb. Unfortunately I have a 4.8 Mbs html file and I get an out of memory error.

上面的代码适用于 1 Mb 以下的文件。不幸的是，我有一个 4.8 Mbs 的 html 文件，并且出现内存不足错误。

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.htmlparser.lexer.Lexer.parseTag(Lexer.java:1002)
    at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:369)
    at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:111)
    at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
    at org.htmlparser.Parser.parse(Parser.java:701)
    at Tools.main(Tools.java:33)

Is there a more efficient way to solve this problem with HTMLParser (I am totally new to the library), or should I use a different library or approach?

使用 HTMLParser 是否有更有效的方法来解决这个问题（我对这个库完全陌生），还是应该使用不同的库或方法？

Answer 1

回答by Kris

Have you tried increased the max heap size of the JVM

您是否尝试过增加 JVM 的最大堆大小

The following command line argument will up it to 512 megabytes: -Xmx512M

以下命令行参数将增加到 512 兆字节：-Xmx512M

E.g.

例如

java -Xmx512M myrunclass

Answer 2

回答by adrian.tarau

Don't build a DOM when you only want to extract some information and you are not interested to perform some XPath queries or other type of queries which perform best on a DOM structure(parent-child relations, etc).

当您只想提取一些信息并且您不想执行某些 XPath 查询或其他类型的在 DOM 结构（父子关系等）上表现最佳的查询时，请不要构建 DOM。

Use Parser.visitAllNodesWith() instead of Prser.parse().

使用 Parser.visitAllNodesWith() 而不是 Praser.parse()。

Answer 3

回答by mtomy

I've faced the same problem. It seems HtmlParser suffers from memory leaks problems and lack of documentation. Doing profiling with JProfiler I've noticed that parsing pages HtmlParser holds link to html code it processed. I've try calling parser.reset() at the end of parsing. It doesn't help. I've also looked on testing code, but found no hints.

我遇到了同样的问题。似乎 HtmlParser 存在内存泄漏问题和缺乏文档。使用 JProfiler 进行分析我注意到解析页面 HtmlParser 包含指向它处理的 html 代码的链接。我尝试在解析结束时调用 parser.reset() 。它没有帮助。我还查看了测试代码，但没有发现任何提示。

As a result I've dramatically decreased memory usage by calling parser.setInputHTML(""); when I don't need parser object more.

结果，我通过调用 parser.setInputHTML(""); 大大减少了内存使用量；当我不再需要解析器对象时。

P.S. it's better to analyse HtmlParser's source code, but I don't have time for this :)

PS最好分析HtmlParser的源代码，但我没有时间做这个:)

Answer 4

回答by simbo1905

HTMLParser has both a parser and a lexer. The parser builds an in memory model but the lexer just notifies you of the tags in the file. For a simple extraction of fixed data then the lexer may be the most efficient way of extracting the data with you having to track the structure of the html as the tags are encountered. The HTMlParser library has not had a release for a while so memory issues are worrying as they seem unlikely to get fixed. Try JSoup if you need a high level parse as that has a powerful query syntax and is very easy to use.

HTMLParser 具有解析器和词法分析器。解析器构建了一个内存模型，但词法分析器只会通知您文件中的标签。对于固定数据的简单提取，词法分析器可能是提取数据的最有效方法，您必须在遇到标签时跟踪 html 的结构。HTMlParser 库已经有一段时间没有发布了，因此内存问题令人担忧，因为它们似乎不太可能得到修复。如果您需要高级解析，请尝试使用 JSoup，因为它具有强大的查询语法并且非常易于使用。

如何使用 Java HTMLParser 库解析大型 HTML 文件

提问by Sergio del Amo

回答by Kris

回答by adrian.tarau

回答by mtomy

回答by simbo1905

相关推荐

最近更新

标签

如何使用 Java HTMLParser 库解析大型 HTML 文件

提问by Sergio del Amo

回答by Kris

回答by adrian.tarau

回答by mtomy

回答by simbo1905

相关推荐

java 如何在构建前删除 Hudson 工作区？

Java，不要等待线程完成

Java 中是否存在用于满足接口的空方法的习语？

更好的 Java 方法语法？早退还是晚退？

相关推荐

最近更新

标签