如何使用 Java HTMLParser 库解析大型 HTML 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/910438/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse a large HTML file with Java HTMLParser library
提问by Sergio del Amo
I have some html files created by Filemaker export. Each file is basically a huge HTML table. I want to iterate through the table rows and populate them into a database. I have tried to do it with HTMLParseras follows:
我有一些由 Filemaker 导出创建的 html 文件。每个文件基本上都是一个巨大的 HTML 表格。我想遍历表行并将它们填充到数据库中。我尝试使用HTMLParser进行如下操作:
String inputHTML = readFile("filemakerExport.htm","UTF-8");
Parser parser = new Parser();
parser.setInputHTML(inputHTML);
parser.setEncoding("UTF-8");
NodeList nl = parser.parse(null);
NodeList trs = nl.extractAllNodesThatMatch(new TagNameFilter("tr"),true);
for(int i=0;i<trs.size();i++) {
NodeList nodes = trs.elementAt(i).getChildren();
NodeList tds = nodes.extractAllNodesThatMatch(new TagNameFilter("td"),true);
// Do stuff with tds
}
The above code works with files under 1 Mb. Unfortunately I have a 4.8 Mbs html file and I get an out of memory error.
上面的代码适用于 1 Mb 以下的文件。不幸的是,我有一个 4.8 Mbs 的 html 文件,并且出现内存不足错误。
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.htmlparser.lexer.Lexer.parseTag(Lexer.java:1002)
at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:369)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:111)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
at org.htmlparser.Parser.parse(Parser.java:701)
at Tools.main(Tools.java:33)
Is there a more efficient way to solve this problem with HTMLParser (I am totally new to the library), or should I use a different library or approach?
使用 HTMLParser 是否有更有效的方法来解决这个问题(我对这个库完全陌生),还是应该使用不同的库或方法?
回答by Kris
Have you tried increased the max heap size of the JVM
您是否尝试过增加 JVM 的最大堆大小
The following command line argument will up it to 512 megabytes: -Xmx512M
以下命令行参数将增加到 512 兆字节:-Xmx512M
E.g.
例如
java -Xmx512M myrunclass
回答by adrian.tarau
Don't build a DOM when you only want to extract some information and you are not interested to perform some XPath queries or other type of queries which perform best on a DOM structure(parent-child relations, etc).
当您只想提取一些信息并且您不想执行某些 XPath 查询或其他类型的在 DOM 结构(父子关系等)上表现最佳的查询时,请不要构建 DOM。
Use Parser.visitAllNodesWith() instead of Prser.parse().
使用 Parser.visitAllNodesWith() 而不是 Praser.parse()。
回答by mtomy
I've faced the same problem. It seems HtmlParser suffers from memory leaks problems and lack of documentation. Doing profiling with JProfiler I've noticed that parsing pages HtmlParser holds link to html code it processed. I've try calling parser.reset() at the end of parsing. It doesn't help. I've also looked on testing code, but found no hints.
我遇到了同样的问题。似乎 HtmlParser 存在内存泄漏问题和缺乏文档。使用 JProfiler 进行分析我注意到解析页面 HtmlParser 包含指向它处理的 html 代码的链接。我尝试在解析结束时调用 parser.reset() 。它没有帮助。我还查看了测试代码,但没有发现任何提示。
As a result I've dramatically decreased memory usage by calling parser.setInputHTML(""); when I don't need parser object more.
结果,我通过调用 parser.setInputHTML(""); 大大减少了内存使用量;当我不再需要解析器对象时。
P.S. it's better to analyse HtmlParser's source code, but I don't have time for this :)
PS最好分析HtmlParser的源代码,但我没有时间做这个:)
回答by simbo1905
HTMLParser has both a parser and a lexer. The parser builds an in memory model but the lexer just notifies you of the tags in the file. For a simple extraction of fixed data then the lexer may be the most efficient way of extracting the data with you having to track the structure of the html as the tags are encountered. The HTMlParser library has not had a release for a while so memory issues are worrying as they seem unlikely to get fixed. Try JSoup if you need a high level parse as that has a powerful query syntax and is very easy to use.
HTMLParser 具有解析器和词法分析器。解析器构建了一个内存模型,但词法分析器只会通知您文件中的标签。对于固定数据的简单提取,词法分析器可能是提取数据的最有效方法,您必须在遇到标签时跟踪 html 的结构。HTMlParser 库已经有一段时间没有发布了,因此内存问题令人担忧,因为它们似乎不太可能得到修复。如果您需要高级解析,请尝试使用 JSoup,因为它具有强大的查询语法并且非常易于使用。

