Python 使用 pytesseract 从图像中识别文本

Question

提问by Smith John

I need to use pytesseract to extract text from this picture:

我需要使用pytesseract从这张图片中提取文本：

and the code:

和代码：

from PIL import Image, ImageEnhance, ImageFilter
import pytesseract
path = 'pic.gif'
img = Image.open(path)
img = img.convert('RGBA')
pix = img.load()
for y in range(img.size[1]):
    for x in range(img.size[0]):
        if pix[x, y][0] < 102 or pix[x, y][1] < 102 or pix[x, y][2] < 102:
            pix[x, y] = (0, 0, 0, 255)
        else:
            pix[x, y] = (255, 255, 255, 255)
img.save('temp.jpg')
text = pytesseract.image_to_string(Image.open('temp.jpg'))
# os.remove('temp.jpg')
print(text)

and the "temp.jpg" is

而“temp.jpg”是

Not bad, but the result of print is ,2 WWNot the right text2HHH, so how can I remove those black dots?

不错，但打印的结果是,2 WWNot the right text 2HHH，那么我该如何去除那些黑点呢？

Answer 1

回答by Smith John

Here is my solution:

这是我的解决方案：

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open("temp.jpg") # the second one 
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'))
print(text)

Answer 2

回答by Dinesh Chandra Kumawat

I have something different pytesseract approach for our community. Here is my approach

我为我们的社区提供了一些不同的 pytesseract 方法。这是我的方法

import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("temp.jpg"), lang='eng',
                        config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

print(text)

Answer 3

回答by nathancy

To perform OCR on an image, its important to preprocess the image. Here's a simple approach using OpenCV and Pytesseract OCR. The idea is to obtain a processed image where the text to extract is in black with the background in white. To do this, we can convert to grayscale, apply a slight Gaussian blur, then Otsu's thresholdto obtain a binary image. From here, we can apply morphological operationsto remove noise. Finally we invert the image. We perform text extraction using the --psm 6configuration option to assume a single uniform block of text. Take a look herefor more options.

要对图像执行 OCR，对图像进行预处理很重要。这是使用 OpenCV 和 Pytesseract OCR 的简单方法。这个想法是获得一个处理过的图像，其中要提取的文本是黑色的，背景是白色的。为此，我们可以转换为灰度，应用轻微的高斯模糊，然后使用Otsu 阈值来获得二值图像。从这里，我们可以应用形态学操作来去除噪声。最后我们反转图像。我们使用--psm 6配置选项执行文本提取以假设单个统一的文本块。查看此处了解更多选项。

Here's a visualization of each step:

这是每个步骤的可视化：

Input image

输入图像

Convert to grayscale ->Gaussian blur ->Otsu's threshold

转换为灰度->高斯模糊->大津阈值

Notice how there are tiny specs of noise, to remove them we can perform morphological operations

注意噪声的微小规格，为了去除它们，我们可以执行形态学操作

Finally we invert the image

最后我们反转图像

Result from Pytesseract OCR

Pytesseract OCR 的结果

2HHH

Code

代码

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening

# Perform text extraction
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('opening', opening)
cv2.imshow('invert', invert)
cv2.waitKey()

Answer 4

回答by SIM

To extract the text directly from the web, you can try the following implementation (making use of the first image):

要直接从网络中提取文本，您可以尝试以下实现(making use of the first image)：

import io
import requests
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance

response = requests.get('https://i.stack.imgur.com/HWLay.gif')
img = Image.open(io.BytesIO(response.content))
img = img.convert('L')
img = img.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2)
img = img.convert('1')
img.save('image.jpg')
imagetext = pytesseract.image_to_string(img)
print(imagetext)

Answer 5

回答by nishit chittora

Here is my small advancement with removing noise and arbitrary line within certain colour frequency range.

这是我在某些颜色频率范围内去除噪声和任意线条的小进步。

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open(img)  # img is the path of the image 
im = im.convert("RGBA")
newimdata = []
datas = im.getdata()

for item in datas:
    if item[0] < 112 or item[1] < 112 or item[2] < 112:
        newimdata.append(item)
    else:
        newimdata.append((255, 255, 255))
im.putdata(newimdata)

im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.jpg')
text = pytesseract.image_to_string(Image.open('temp2.jpg'),config='-c tessedit_char_whitelist=0123456789abcdefghijklmnopqrstuvwxyz -psm 6', lang='eng')
print(text)

Answer 6

回答by nexoma

you only need grow up the size of picture by cv2.resize

你只需要通过 cv2.resize 增大图片的大小

image = cv2.resize(image,(0,0),fx=7,fy=7)

my picture 200x40 -> HZUBS

我的图片 200x40 -> HZUBS

resized same picture 1400x300 -> A 1234(so, this is right)

调整相同图片的大小 1400x300 -> A 1234（所以，这是正确的）

and then,

进而，

retval, image = cv2.threshold(image,200,255, cv2.THRESH_BINARY)
image = cv2.GaussianBlur(image,(11,11),0)
image = cv2.medianBlur(image,9)

and change parameters for enhance results

并更改参数以增强结果

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
            bypassing hacks that are Tesseract-specific.

Python 使用 pytesseract 从图像中识别文本

提问by Smith John

回答by Smith John

回答by Dinesh Chandra Kumawat

回答by nathancy

回答by SIM

回答by nishit chittora

回答by nexoma

相关推荐

最近更新

标签

Python 使用 pytesseract 从图像中识别文本

提问by Smith John

回答by Smith John

回答by Dinesh Chandra Kumawat

回答by nathancy

回答by SIM

回答by nishit chittora

回答by nexoma

相关推荐

Python 如何在不写入磁盘的情况下将 AWS S3 上的文本文件导入 Pandas

如何在 Google 的 Colab 中安装 Python 包？

Python PySpark：when 子句中有多个条件

Python 扁平化多级 JSON

相关推荐

最近更新

标签