Linux 开源 OCR

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5151798/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 03:04:17  来源:igfitidea点击:

Open source OCR

javarubylinuxpdfocr

提问by Chris

I'm looking for an open source OCR library that runs on Linux. I need this to work for PNGs and PDFs. Mostly I would like to interface this library from java or ruby. Any idea if there is anything available?

我正在寻找一个在 Linux 上运行的开源 OCR 库。我需要它来处理 PNG 和 PDF。大多数情况下,我想从 java 或 ruby​​ 接口这个库。知道是否有任何可用的东西吗?

Regards.

问候。

回答by Ben Hymanson

Cuneiformis free and does a decent job. You could invoke it as a subprogram but there's no language binding that I know of. It won't read PDFs directly but you can easily take apart PDFs that are sequences of scanned images to feed them to Cuneiform. There are also scripts to reassemble the images and text back into a searchable PDF.

楔形文字是免费的并且做得不错。您可以将它作为子程序调用,但我知道没有语言绑定。它不会直接读取 PDF,但您可以轻松地拆开作为扫描图像序列的 PDF,将它们提供给楔形文字。还有一些脚本可以将图像和文本重新组合成可搜索的 PDF。

回答by olivierlemasle

Tesseract is a very good OCR engine: https://github.com/tesseract-ocr/tesseract

Tesseract 是一个非常好的 OCR 引擎:https: //github.com/tesseract-ocr/tesseract

The project has been launched by HP Labs and is now continued and sponsored by Google (for Google Books !). It is released under the Apache license, and it runs on Linux. It uses Tiff or PNGs files ; for PDFs, you will need to convert to one of these formats. I suppose that there is no binding so you should invoke this software as a subprogram...

该项目已由 HP 实验室启动,现在由 Google(用于 Google 图书!)继续并赞助。它是在 Apache 许可下发布的,并且在 Linux 上运行。它使用 Tiff 或 PNGs 文件;对于 PDF,您需要转换为这些格式之一。我想没有绑定所以你应该调用这个软件作为子程序......

回答by nguyenq

Try tesjeract, which uses JNI to call Tesseract OCR API.

试试tesjeract,它使用 JNI 调用 Tesseract OCR API。

For PDF, you'll need to convert them to image first, using GhostScript, for instance.

对于 PDF,您需要先将它们转换为图像,例如使用 GhostScript。