java 使用 OCR 的 PDF 文本提取方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/778145/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-29 13:44:26  来源:igfitidea点击:

PDF Text Extraction Approach Using OCR

javapdftext-parsing

提问by Jon

Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written.

有没有人尝试使用 OCR 库和 Java 从 PDF 中提取文本?您认为最可靠的文本提取库是什么?我见过的大多数方法(tesseract、GOCR)都是需要编写一些 JNI 代码的 C 库。

I'm familiar with pdfbox, which is now an Apache incubator project at version 0.8.x, but it's text extraction isn't always accurate. I'm looking for an alternative approach that is somewhat more reliable.

我熟悉 pdfbox,它现在是 0.8.x 版的 Apache 孵化器项目,但它的文本提取并不总是准确的。我正在寻找一种更可靠的替代方法。

I've not tried Asprise JavaPDF yet, in the process of trying that, but wanted to know more about the OCR approach (if it's possible).

我还没有尝试过 Asprise JavaPDF,在尝试的过程中,但想更多地了解 OCR 方法(如果可能的话)。

Any help would be appreciated.

任何帮助,将不胜感激。

回答by Sam Barnum

If you have a text-based PDF, I'd strongly recommend PDFTextStream. It's not free, but licensing is reasonable, and it is much much better than PDFBox. PDFBox chokes on many PDF files which are generated by newer tools, and is not too consistent about PDFs it can handle. PDFTextStream handles any PDF I throw at it, including PDFs with embedded PNG images, which PDFBox can not do.

如果您有基于文本的 PDF,我强烈推荐PDFTextStream。它不是免费的,但许可是合理的,而且比 PDFBox 好得多。PDFBox 在许多由较新工具生成的 PDF 文件上阻塞,并且它可以处理的 PDF 不太一致。PDFTextStream 处理我扔给它的任何 PDF,包括带有嵌入 PNG 图像的 PDF,而 PDFBox 无法做到。

If you heckle the PDFTextStream folks to add OCR, they may listen up.

如果您质询 PDFTextStream 人员添加 OCR,他们可能会听。

回答by Andrew

We use ABBYY FineReader Engine 11. They have java wrapper.

我们使用 ABBYY FineReader Engine 11。他们有 java 包装器。

Pros:

优点:

  • It works great with all the languages (English, Russian, Uzbek etc) and doing real OCR (even if you have pdf without OCR they perform rendering at first and OCRing).
  • 它适用于所有语言(英语、俄语、乌兹别克语等)并进行真正的 OCR(即使您有没有 OCR 的 pdf,它们首先执行渲染和 OCR 处理)。

Cons:

缺点:

  • It costs. You have to buy developer license and end-user license.

  • And it is EXTREMELY slow.

  • 它的成本。您必须购买开发者许可证和最终用户许可证。

  • 它非常慢。

回答by Otávio Décio

If you want to extract OCR from text based PDF you may have to convert it to an image first.

如果您想从基于文本的 PDF 中提取 OCR,您可能必须先将其转换为图像。

回答by nguyenq

You can use Java wrappers of Tesseract - tesjeract or Tess4J - to perform OCR. However, for PDF, you'll need to convert to image (PNG or TIFF) first before feeding it to the OCR engine.

您可以使用 Tesseract 的 Java 包装器(tesjeract 或 Tess4J)来执行 OCR。但是,对于 PDF,您需要先将其转换为图像(PNG 或 TIFF),然后再将其提供给 OCR 引擎。

VietOCRcalls Tesseract executable to perform the text extraction. It uses GhostScript to do PDF-to-image conversion.

VietOCR调用 Tesseract 可执行文件来执行文本提取。它使用 GhostScript 进行 PDF 到图像的转换。