Python Pytesseract OCR 多个配置选项

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44619077/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 00:14:27  来源:igfitidea点击:

Pytesseract OCR multiple config options

pythonocrtesseract

提问by Niall Oswald

I am having some problems with pytesseract. I need to configure Tesseract to that it is configured to accept single digits while also only being able to accept numbers as the number zero is often confused with an 'O'.

我在使用 pytesseract 时遇到了一些问题。我需要将 Tesseract 配置为接受单个数字,同时也只能接受数字,因为数字零经常与“O”混淆。

Like this:

像这样:

target = pytesseract.image_to_string(im,config='-psm 7',config='outputbase digits')

回答by thewaywewere

tesseract-4.0.0asupports below psm. If you want to have single character recognition, set psm = 10. And if your text consists of numbers only, you can set tessedit_char_whitelist=0123456789.

tesseract-4.0.0a支持以下psm。如果要识别单个字符,请设置psm = 10. 如果您的文本仅包含数字,则可以设置tessedit_char_whitelist=0123456789.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
                        bypassing hacks that are Tesseract-specific.

Here is a sample usage of image_to_stringwith multiple parameters.

这是image_to_string具有多个参数的示例用法。

target = pytesseract.image_to_string(image, lang='eng', boxes=False, \
        config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

Hope this helps.

希望这可以帮助。

回答by RALPH BURLESON

The reason you are having trouble is because character restriction does not work in version 4.0. You have to force legacy mode (oem 0) to have it limit found characters. There is a bug somewhere in the tesseract team that they have not yet addressed.

您遇到问题的原因是字符限制在 4.0 版中不起作用。您必须强制使用旧模式 (oem 0) 来限制找到的字符。tesseract 团队的某个地方存在一个他们尚未解决的错误。