Java 性能 iText 与 PdfBox

Question

提问by meilechh

I'm trying to convert a pdf (my favorite book Effective Java, if its matter)to text, i checked both iText and Apache PdfBox. I see a really big difference in performance: With iText it took 2:521, and with PdfBox: 6:117. This if my code for PdfBOx

我正在尝试将 pdf（我最喜欢的书 Effective Java，如果它很重要）转换为文本，我检查了 iText 和 Apache PdfBox。我看到性能上的巨大差异：使用 iText 需要 2:521，使用 PdfBox：6:117。如果我的 PdfBOx 代码

PDFTextStripper stripper = new PDFTextStripper();
BUFFER.append(stripper.getText(PDDocument.load(pdf)));

And this is for iText

这是给 iText 的

PdfReader reader = new PdfReader(pdf);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
  BUFFER.append(PdfTextExtractor.getTextFromPage(reader, i));
}

My question is in what the performance depends, is there a way how to make PdfBox faster? Or only to use iText? And can you explain more about how strategies affect performance?

我的问题是性能取决于什么，有没有办法让 PdfBox 更快？还是只使用 iText？您能否解释更多有关策略如何影响绩效的信息？

Answer 1

采纳答案by mkl

My question is in what the performance depends, is there a way how to make PdfBox faster?

我的问题是性能取决于什么，有没有办法让 PdfBox 更快？

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources.

一个主要区别是PDFBox总是一个字形地处理文本字形，而iText通常是一个块地处理它（即文本绘制操作的单个字符串参数）；这大大减少了 iText 中所需的资源。此外，iText 文本解析的面向事件架构意味着比 PDFBox 的资源负担更低。并且 PDFBox 使纯文本提取不需要的信息可以更长时间地使用，从而消耗更多资源。

But the way the libraries initially load the document may also make a difference. Here you can experiment a bit, PDFBox not only offers multiple PDDocument.loadoverloads but also some PDDocument.loadNonSeqoverloads (actually PDDocument.loadNonSeqreads documents correctly while PDDocument.loadcan be tricked to misinterpret PDFs). All these different variants may have different runtime behavior.

但库最初加载文档的方式也可能有所不同。在这里您可以尝试一下，PDFBox 不仅提供多个PDDocument.load重载，还提供一些PDDocument.loadNonSeq重载（实际上可以PDDocument.loadNonSeq正确读取文档，但PDDocument.load可能会被欺骗以误解 PDF）。所有这些不同的变体可能具有不同的运行时行为。

more about how strategies affect performance?

更多关于策略如何影响绩效？

iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.

iText 带来了一种简单且更高级的文本提取策略。简单的假设页面内容流中的文本以阅读顺序出现，而更高级的则是排序。默认情况下使用更高级的。因此，您可能可以通过使用简单的策略来进一步加快 iText 的速度。PDFBox 总是排序。

Answer 2

回答by Bhaskara Arani

In the PDFBox - Version 2.0.12, they optimized the PDFunctionType3.eval() by 30%, reduced the RAM requirement of COSOutputStream, and also removed intermediate streams when merging files. All this information is provided in their release notes. Please see the link below for more information:

在 PDFBox - 2.0.12 版中，他们将 PDFunctionType3.eval() 优化了 30%，降低了 COSOutputStream 的 RAM 要求，并在合并文件时删除了中间流。所有这些信息都在他们的发行说明中提供。请参阅以下链接了解更多信息：

https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343489&styleName=Html&projectId=12310760&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED%7Cddb31610c9c60486ac6cc58a5800069ddf68ccd5%7Clout

https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343489&styleName=Html&projectId=12310760&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED%7Cddb31610c9c604860c9c658a65c9c604865d60cc

Java 性能 iText 与 PdfBox

提问by meilechh

采纳答案by mkl

回答by Bhaskara Arani

相关推荐

最近更新

标签

Java 性能 iText 与 PdfBox

提问by meilechh

采纳答案by mkl

回答by Bhaskara Arani

相关推荐

在 Java 中对 long 进行位移

Java 尝试从列表中删除元素时，为什么会出现 UnsupportedOperationException？

HTTPClient 示例 - 线程“main”中的异常 java.lang.NoSuchFieldError: INSTANCE

Java 使用 Apache Commons I/O 将数据附加到文件中

相关推荐

最近更新

标签