Python 使用 pytesseract 从图像中识别文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37745519/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:53:06  来源:igfitidea点击:

use pytesseract to recognize text from image

pythonimageocrpytesser

提问by Smith John

I need to use pytesseract to extract text from this picture: enter image description here

我需要使用pytesseract从这张图片中提取文本: 在此处输入图片说明

and the code:

和代码:

from PIL import Image, ImageEnhance, ImageFilter
import pytesseract
path = 'pic.gif'
img = Image.open(path)
img = img.convert('RGBA')
pix = img.load()
for y in range(img.size[1]):
    for x in range(img.size[0]):
        if pix[x, y][0] < 102 or pix[x, y][1] < 102 or pix[x, y][2] < 102:
            pix[x, y] = (0, 0, 0, 255)
        else:
            pix[x, y] = (255, 255, 255, 255)
img.save('temp.jpg')
text = pytesseract.image_to_string(Image.open('temp.jpg'))
# os.remove('temp.jpg')
print(text)

and the "temp.jpg" is enter image description here

而“temp.jpg”是 在此处输入图片说明

Not bad, but the result of print is ,2 WWNot the right text2HHH, so how can I remove those black dots?

不错,但打印的结果是,2 WWNot the right text 2HHH,那么我该如何去除那些黑点呢?

回答by Smith John

Here is my solution:

这是我的解决方案:

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open("temp.jpg") # the second one 
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'))
print(text)

回答by Dinesh Chandra Kumawat

I have something different pytesseract approach for our community. Here is my approach

我为我们的社区提供了一些不同的 pytesseract 方法。这是我的方法

import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("temp.jpg"), lang='eng',
                        config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

print(text)

回答by nathancy

To perform OCR on an image, its important to preprocess the image. Here's a simple approach using OpenCV and Pytesseract OCR. The idea is to obtain a processed image where the text to extract is in black with the background in white. To do this, we can convert to grayscale, apply a slight Gaussian blur, then Otsu's thresholdto obtain a binary image. From here, we can apply morphological operationsto remove noise. Finally we invert the image. We perform text extraction using the --psm 6configuration option to assume a single uniform block of text. Take a look herefor more options.

要对图像执行 OCR,对图像进行预处理很重要。这是使用 OpenCV 和 Pytesseract OCR 的简单方法。这个想法是获得一个处理过的图像,其中要提取的文本是黑色的,背景是白色的。为此,我们可以转换为灰度,应用轻微的高斯模糊,然后使用Otsu 阈值来获得二值图像。从这里,我们可以应用形态学操作来去除噪声。最后我们反转图像。我们使用--psm 6配置选项执行文本提取以假设单个统一的文本块。查看此处了解更多选项。



Here's a visualization of each step:

这是每个步骤的可视化:

Input image

输入图像

enter image description here

在此处输入图片说明

Convert to grayscale ->Gaussian blur ->Otsu's threshold

转换为灰度->高斯模糊->大津阈值

enter image description here

在此处输入图片说明

Notice how there are tiny specs of noise, to remove them we can perform morphological operations

注意噪声的微小规格,为了去除它们,我们可以执行形态学操作

enter image description here

在此处输入图片说明

Finally we invert the image

最后我们反转图像

enter image description here

在此处输入图片说明

Result from Pytesseract OCR

Pytesseract OCR 的结果

2HHH

Code

代码

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening

# Perform text extraction
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('opening', opening)
cv2.imshow('invert', invert)
cv2.waitKey()

回答by SIM

To extract the text directly from the web, you can try the following implementation (making use of the first image):

要直接从网络中提取文本,您可以尝试以下实现(making use of the first image)

import io
import requests
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance

response = requests.get('https://i.stack.imgur.com/HWLay.gif')
img = Image.open(io.BytesIO(response.content))
img = img.convert('L')
img = img.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2)
img = img.convert('1')
img.save('image.jpg')
imagetext = pytesseract.image_to_string(img)
print(imagetext)

回答by nishit chittora

Here is my small advancement with removing noise and arbitrary line within certain colour frequency range.

这是我在某些颜色频率范围内去除噪声和任意线条的小进步。

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open(img)  # img is the path of the image 
im = im.convert("RGBA")
newimdata = []
datas = im.getdata()

for item in datas:
    if item[0] < 112 or item[1] < 112 or item[2] < 112:
        newimdata.append(item)
    else:
        newimdata.append((255, 255, 255))
im.putdata(newimdata)

im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'),config='-c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyz -psm 6', lang='eng')
print(text)

回答by nexoma

you only need grow up the size of picture by cv2.resize

你只需要通过 cv2.resize 增大图片的大小

image = cv2.resize(image,(0,0),fx=7,fy=7)

my picture 200x40 -> HZUBS

我的图片 200x40 -> HZUBS

resized same picture 1400x300 -> A 1234(so, this is right)

调整相同图片的大小 1400x300 -> A 1234(所以,这是正确的)

and then,

进而,

retval, image = cv2.threshold(image,200,255, cv2.THRESH_BINARY)
image = cv2.GaussianBlur(image,(11,11),0)
image = cv2.medianBlur(image,9)

and change parameters for enhance results

并更改参数以增强结果

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
            bypassing hacks that are Tesseract-specific.