Java 性能 iText 与 PdfBox
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22340674/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Performance iText vs.PdfBox
提问by meilechh
I'm trying to convert a pdf (my favorite book Effective Java, if its matter)to text, i checked both iText and Apache PdfBox. I see a really big difference in performance: With iText it took 2:521, and with PdfBox: 6:117. This if my code for PdfBOx
我正在尝试将 pdf(我最喜欢的书 Effective Java,如果它很重要)转换为文本,我检查了 iText 和 Apache PdfBox。我看到性能上的巨大差异:使用 iText 需要 2:521,使用 PdfBox:6:117。如果我的 PdfBOx 代码
PDFTextStripper stripper = new PDFTextStripper();
BUFFER.append(stripper.getText(PDDocument.load(pdf)));
And this is for iText
这是给 iText 的
PdfReader reader = new PdfReader(pdf);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
BUFFER.append(PdfTextExtractor.getTextFromPage(reader, i));
}
My question is in what the performance depends, is there a way how to make PdfBox faster? Or only to use iText? And can you explain more about how strategies affect performance?
我的问题是性能取决于什么,有没有办法让 PdfBox 更快?还是只使用 iText?您能否解释更多有关策略如何影响绩效的信息?
采纳答案by mkl
My question is in what the performance depends, is there a way how to make PdfBox faster?
我的问题是性能取决于什么,有没有办法让 PdfBox 更快?
One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources.
一个主要区别是PDFBox总是一个字形地处理文本字形,而iText通常是一个块地处理它(即文本绘制操作的单个字符串参数);这大大减少了 iText 中所需的资源。此外,iText 文本解析的面向事件架构意味着比 PDFBox 的资源负担更低。并且 PDFBox 使纯文本提取不需要的信息可以更长时间地使用,从而消耗更多资源。
But the way the libraries initially load the document may also make a difference. Here you can experiment a bit, PDFBox not only offers multiple PDDocument.load
overloads but also some PDDocument.loadNonSeq
overloads (actually PDDocument.loadNonSeq
reads documents correctly while PDDocument.load
can be tricked to misinterpret PDFs). All these different variants may have different runtime behavior.
但库最初加载文档的方式也可能有所不同。在这里您可以尝试一下,PDFBox 不仅提供多个PDDocument.load
重载,还提供一些PDDocument.loadNonSeq
重载(实际上可以PDDocument.loadNonSeq
正确读取文档,但PDDocument.load
可能会被欺骗以误解 PDF)。所有这些不同的变体可能具有不同的运行时行为。
more about how strategies affect performance?
更多关于策略如何影响绩效?
iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.
iText 带来了一种简单且更高级的文本提取策略。简单的假设页面内容流中的文本以阅读顺序出现,而更高级的则是排序。默认情况下使用更高级的。因此,您可能可以通过使用简单的策略来进一步加快 iText 的速度。PDFBox 总是排序。
回答by Bhaskara Arani
In the PDFBox - Version 2.0.12, they optimized the PDFunctionType3.eval() by 30%, reduced the RAM requirement of COSOutputStream, and also removed intermediate streams when merging files. All this information is provided in their release notes. Please see the link below for more information:
在 PDFBox - 2.0.12 版中,他们将 PDFunctionType3.eval() 优化了 30%,降低了 COSOutputStream 的 RAM 要求,并在合并文件时删除了中间流。所有这些信息都在他们的发行说明中提供。请参阅以下链接了解更多信息: