将扫描的pdf转换为文本python

Question

提问by Michal

I have a scanned pdf file and I try to extract text from it. I tried to use pypdfocr to make ocr on it but I have error:

我有一个扫描的 pdf 文件，我尝试从中提取文本。我尝试使用 pypdfocr 在其上制作 ocr，但出现错误：

"could not found ghostscript in the usual place"

“在通常的地方找不到ghostscript”

After searching I found this solution Linking Ghostscript to pypdfocr in Windows Platformand I tried to download GhostScript and put it in environment variable but it still has the same error.

搜索后，我找到了此解决方案Linking Ghostscript to pypdfocr in Windows Platform，我尝试下载 GhostScript 并将其放入环境变量中，但仍然出现相同的错误。

How can I searh text in my scanned pdf file using python?

如何使用 python 在扫描的 pdf 文件中搜索文本？

Thanks.

谢谢。

Edit: here is my code sample:

编辑：这是我的代码示例：

import os
import sys
import re
import json
import shutil
import glob
from pypdfocr import pypdfocr_gs
from pypdfocr import pypdfocr_tesseract 
from PIL import Image

path = PATH_TO_MY_SCANNED_PDF
mainL = []
kk = {}


def new_init(self, kk):
    self.lang = 'heb'   
    self.binary = "tesseract"
    self.msgs = {
            'TS_MISSING': """ 
                Could not execute %s
                Please make sure you have Tesseract installed correctly
                """ % self.binary,
            'TS_VERSION':'Tesseract version is too old',
            'TS_img_MISSING':'Cannot find specified tiff file',
            'TS_FAILED': 'Tesseract-OCR execution failed!',
        }

pypdfocr_tesseract.PyTesseract.__init__ = new_init  

wow = pypdfocr_gs.PyGs(kk)
tt = pypdfocr_tesseract.PyTesseract(kk)


def secFile(filename,oldfilename):
    wow.make_img_from_pdf(filename)


    files = glob.glob("X:/e206333106/ocr-114/balagan/" + '*.jpg')  
    for file in files:
        im = Image.open(file)
        im.save(file + ".tiff") 

    files = glob.glob("PATH" + '*.tiff')  
    for file in files:
        tt.make_hocr_from_pnm(file)
    pdftxt = ""    
    files = glob.glob("PATH" + '*.html') 
    for file in files:
        with open(file) as myfile:
            pdftxt = pdftxt + "#" + "".join(line.rstrip() for line in myfile)
    findNum(pdftxt,oldfilename)

    folder ="PATH"

    for the_file in os.listdir(folder):
        file_path = os.path.join(folder, the_file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
        except Exception, e:
            print e

def pdf2ocr(filename):
    pdffile = filename
    os.system('pypdfocr -l heb ' + pdffile)

def ocr2txt(filename):  
    pdffile = filename


    output1 = pdffile.replace(".pdf","_ocr.txt")
    output1 = "PATH" + os.path.basename(output1)

    input1 = pdffile.replace(".pdf","_ocr.pdf")

    os.system("pdf2txt" -o  + output1 + " " + input1) 

    with open(output1) as myfile:
        pdftxt="".join(line.rstrip() for line in myfile)
    findNum(pdftxt,filename)


def findNum(pdftxt,pdffile):
    l = re.findall(r'\b\d+\b', pdftxt)


    output = open('PATH' + os.path.basename(pdffile) + '.txt', 'w')
    for i in l:
        output.write(",")
        output.write(i)
    output.close()    

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

i = 0     
files = glob.glob(path + '\*.pdf') 
print path  
print files 
for file in files:
    if file.endswith(".pdf"):
        if is_ascii(file):
            print file
            pdf2ocr(file)    
            ocr2txt(file)
        else:
            newname = "PATH" + str(i) + ".pdf"
            shutil.copyfile(file, newname)
            print newname
            secFile(newname,file)
        i = i + 1

files = glob.glob(path + '\' + '*_ocr.pdf')         

for file in files:
    print file
    shutil.copyfile(file, "PATH" + os.path.basename(file))
    os.remove(file)

Answer 1

采纳答案by ghovat

Take a look at this library: https://pypi.python.org/pypi/pypdfocrbut a PDF file can have also images in it. You may be able to analyse the page content streams. Some scanners break up the single scanned page into images, so you won't get the text with ghostscript.

看看这个库：https: //pypi.python.org/pypi/pypdfocr但 PDF 文件中也可以包含图像。您或许能够分析页面内容流。一些扫描仪将单个扫描页面分解为图像，因此您不会获得带有 ghostscript 的文本。

Answer 2

回答by TRINADH NAGUBADI

Take a look at my code it is worked for me.

看看我的代码，它对我有用。

import os
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc



pdf=wi(filename=pdf_path,resolution=300)
pdfImg=pdf.convert('jpeg')

imgBlobs=[]
extracted_text=[]

def Get_text_from_image(pdf_path):
    pdf=wi(filename=pdf_path,resolution=300)
    pdfImg=pdf.convert('jpeg')
    imgBlobs=[]
    extracted_text=[]
    for img in pdfImg.sequence:
        page=wi(image=img)
        imgBlobs.append(page.make_blob('jpeg'))

    for imgBlob in imgBlobs:
        im=Image.open(io.BytesIO(imgBlob))
        text=pytesseract.image_to_string(im,lang='eng')
        extracted_text.append(text)

    return (extracted_text)

I fixed it for me by editing the /etc/ImageMagick-6/policy.xml and changed the rights for the pdf line to "read|write":

我通过编辑 /etc/ImageMagick-6/policy.xml 为我修复了它，并将 pdf 行的权限更改为“读|写”：

Open the terminal and change the path

打开终端并更改路径

cd /etc/ImageMagick-6
nano policy.xml
<policy domain="coder" rights="read" pattern="PDF" /> 
change to
<policy domain="coder" rights="read|write" pattern="PDF" />
exit

When i was extracting the pdf images to text i faced some issues please go through the below link

当我将 pdf 图像提取为文本时，我遇到了一些问题，请查看以下链接

https://stackoverflow.com/questions/52699608/wand-policy-error- 
error-constitute-c-readimage-412

https://stackoverflow.com/questions/52861946/imagemagick-not- 
authorized-to-convert-pdf-to-an-image

Increasing the memory limit  please go through the below link
enter code here
https://github.com/phw/peek/issues/112
https://github.com/ImageMagick/ImageMagick/issues/396

Answer 3

回答by Ali

PyPDF2 is a python library built as a PDF toolkit. It is capable of:

PyPDF2 是一个构建为 PDF 工具包的 Python 库。它能够：

Extracting document information (title, author, …)
Splitting documents page by page
Merging documents page by page
Cropping pages
Merging multiple pages into a single page
Encrypting and decrypting PDF files
and more!

To install PyPDF2, run following command from command line:

要安装 PyPDF2，请从命令行运行以下命令：

pip install PyPDF2

CODE:

代码：

import PyPDF2 

pdfFileObj = open('myPdf.pdf', 'rb') 


pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

print(pdfReader.numPages) 

pageObj = pdfReader.getPage(0) 

print(pageObj.extractText()) 

pdfFileObj.close()

Answer 4

回答by DougR

Convert pdfs, using pytesseract to do the OCR, and export each page in the pdfs to a text file.

转换pdf，使用pytesseract进行OCR，将pdf中的每一页导出为文本文件。

Install these....

安装这些....

conda install -c conda-forge pytesseract
conda install -c conda-forge tesseract
pip install pdf2image

conda install -c conda-forge pytesseract
conda install -c conda-forge tesseract
pip 安装 pdf2image

import pytesseract
from pdf2image import convert_from_path
import glob

pdfs = glob.glob(r"yourPath\*.pdf")

for pdf_path in pdfs:
    pages = convert_from_path(pdf_path, 500)

    for pageNum,imgBlob in enumerate(pages):
        text = pytesseract.image_to_string(imgBlob,lang='eng')

        with open(f'{pdf_path[:-4]}_page{pageNum}.txt', 'w') as the_file:
            the_file.write(text)

Answer 5

回答by E. Alex

You can use OpenCV for python. There are a lot of examplesabout detection of text.

您可以将 OpenCV 用于 python。有很多关于文本检测的例子。

将扫描的pdf转换为文本python

提问by Michal

采纳答案by ghovat

回答by TRINADH NAGUBADI

回答by Ali

回答by DougR

回答by E. Alex

相关推荐

最近更新

标签

将扫描的pdf转换为文本python

提问by Michal

采纳答案by ghovat

回答by TRINADH NAGUBADI

回答by Ali

回答by DougR

回答by E. Alex

相关推荐

Python pandas.Series() 使用 DataFrame Columns 创建返回 NaN 数据条目

Python Pandas 将多列零替换为 Nan

使用 Paramiko 在 Python 中通过 ssh 实现交互式 shell？

Python Gensim：KeyError：“单词不在词汇表中”

相关推荐

最近更新

标签