Python 使用 pytesseract 从图像中识别文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37745519/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
use pytesseract to recognize text from image
提问by Smith John
I need to use pytesseract to extract text from this picture:
and the code:
和代码:
from PIL import Image, ImageEnhance, ImageFilter
import pytesseract
path = 'pic.gif'
img = Image.open(path)
img = img.convert('RGBA')
pix = img.load()
for y in range(img.size[1]):
for x in range(img.size[0]):
if pix[x, y][0] < 102 or pix[x, y][1] < 102 or pix[x, y][2] < 102:
pix[x, y] = (0, 0, 0, 255)
else:
pix[x, y] = (255, 255, 255, 255)
img.save('temp.jpg')
text = pytesseract.image_to_string(Image.open('temp.jpg'))
# os.remove('temp.jpg')
print(text)
Not bad, but the result of print is ,2 WW
Not the right text2HHH
, so how can I remove those black dots?
不错,但打印的结果是,2 WW
Not the right text 2HHH
,那么我该如何去除那些黑点呢?
回答by Smith John
Here is my solution:
这是我的解决方案:
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
im = Image.open("temp.jpg") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'))
print(text)
回答by Dinesh Chandra Kumawat
I have something different pytesseract approach for our community. Here is my approach
我为我们的社区提供了一些不同的 pytesseract 方法。这是我的方法
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("temp.jpg"), lang='eng',
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)
回答by nathancy
To perform OCR on an image, its important to preprocess the image. Here's a simple approach using OpenCV and Pytesseract OCR. The idea is to obtain a processed image where the text to extract is in black with the background in white. To do this, we can convert to grayscale, apply a slight Gaussian blur, then Otsu's thresholdto obtain a binary image. From here, we can apply morphological operationsto remove noise. Finally we invert the image. We perform text extraction using the --psm 6
configuration option to assume a single uniform block of text. Take a look herefor more options.
要对图像执行 OCR,对图像进行预处理很重要。这是使用 OpenCV 和 Pytesseract OCR 的简单方法。这个想法是获得一个处理过的图像,其中要提取的文本是黑色的,背景是白色的。为此,我们可以转换为灰度,应用轻微的高斯模糊,然后使用Otsu 阈值来获得二值图像。从这里,我们可以应用形态学操作来去除噪声。最后我们反转图像。我们使用--psm 6
配置选项执行文本提取以假设单个统一的文本块。查看此处了解更多选项。
Here's a visualization of each step:
这是每个步骤的可视化:
Input image
输入图像
Convert to grayscale ->
Gaussian blur ->
Otsu's threshold
转换为灰度->
高斯模糊->
大津阈值
Notice how there are tiny specs of noise, to remove them we can perform morphological operations
注意噪声的微小规格,为了去除它们,我们可以执行形态学操作
Finally we invert the image
最后我们反转图像
Result from Pytesseract OCR
Pytesseract OCR 的结果
2HHH
Code
代码
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening
# Perform text extraction
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('opening', opening)
cv2.imshow('invert', invert)
cv2.waitKey()
回答by SIM
To extract the text directly from the web, you can try the following implementation (making use of the first image)
:
要直接从网络中提取文本,您可以尝试以下实现(making use of the first image)
:
import io
import requests
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance
response = requests.get('https://i.stack.imgur.com/HWLay.gif')
img = Image.open(io.BytesIO(response.content))
img = img.convert('L')
img = img.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2)
img = img.convert('1')
img.save('image.jpg')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
回答by nishit chittora
Here is my small advancement with removing noise and arbitrary line within certain colour frequency range.
这是我在某些颜色频率范围内去除噪声和任意线条的小进步。
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
im = Image.open(img) # img is the path of the image
im = im.convert("RGBA")
newimdata = []
datas = im.getdata()
for item in datas:
if item[0] < 112 or item[1] < 112 or item[2] < 112:
newimdata.append(item)
else:
newimdata.append((255, 255, 255))
im.putdata(newimdata)
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'),config='-c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyz -psm 6', lang='eng')
print(text)
回答by nexoma
you only need grow up the size of picture by cv2.resize
你只需要通过 cv2.resize 增大图片的大小
image = cv2.resize(image,(0,0),fx=7,fy=7)
my picture 200x40 -> HZUBS
我的图片 200x40 -> HZUBS
resized same picture 1400x300 -> A 1234(so, this is right)
调整相同图片的大小 1400x300 -> A 1234(所以,这是正确的)
and then,
进而,
retval, image = cv2.threshold(image,200,255, cv2.THRESH_BINARY)
image = cv2.GaussianBlur(image,(11,11),0)
image = cv2.medianBlur(image,9)
and change parameters for enhance results
并更改参数以增强结果
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.