Python 从 pdf 中提取页面作为 jpeg

Question

提问by vishvAs vAsuki

In python code, how to efficiently save a certain page in a pdf as a jpeg file? (Use case: I've a python flask web server where pdf-s will be uploaded and jpeg-s corresponding to each page is stores.)

在python代码中，如何有效地将pdf中的某个页面保存为jpeg文件？（用例：我有一个 python Flask Web 服务器，其中 pdf-s 将被上传，每个页面对应的 jpeg-s 是存储。）

This solutionis close, but the problem is that it does not convert the entire page to jpeg.

这个解决方案很接近，但问题是它没有将整个页面转换为jpeg。

Answer 1

回答by Keval Dave

The pdf2image library can be used.

可以使用 pdf2image 库。

You can install it simply using,

您可以简单地使用安装它，

pip install pdf2image

Once installed you can use following code to get images.

安装后，您可以使用以下代码获取图像。

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

以 jpeg 格式保存页面

for page in pages:
    page.save('out.jpg', 'JPEG')

Edit: the Github repo pdf2imagealso mentions that it uses pdftoppmand that it requires other installations:

编辑：Github repo pdf2image还提到它使用pdftoppm并且需要其他安装：

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

pdftoppm 是一款可以发挥实际作用的软件。它作为名为poppler的更大包的一部分分发。Windows 用户必须为 Windows安装poppler。Mac 用户必须为 Mac安装poppler。Linux 用户将在发行版中预安装 pdftoppm（在 Ubuntu 和 Archlinux 上测试），如果不是，请运行sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

您可以通过执行以下操作使用 anaconda 在 Windows 下安装最新版本：

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/but note that 0.68 was released in Aug 2018so you'll not be getting the latest features or bug fixes.

注意：http://blog.alivate.com.au/poppler-windows/ 上提供了高达 0.67 的 Windows 版本，但请注意，0.68 于2018 年 8 月发布，因此您将无法获得最新功能或错误修复。

Answer 2

回答by JJPty

I found this simple solution, PyMuPDF, output to png file. Note the library is imported as "fitz", a historical name for the rendering engine it uses.

我发现这个简单的解决方案PyMuPDF输出到 png 文件。请注意，该库被导入为“fitz”，这是它使用的渲染引擎的历史名称。

    import fitz
    pdffile = "infile.pdf"
    doc = fitz.open(pdffile)
    page = doc.loadPage(0) #number of page
    pix = page.getPixmap()
    output = "outfile.png"
    pix.writePNG(output)

Answer 3

回答by Basj

The Python library pdf2image(used in the other answer) in fact doesn't do much more than just launchingpdttoppmwith subprocess.Popen, so here is a short version doing it directly:

Python库pdf2image其实（在对方的回答中）没有做远不止推出pdttoppm有subprocess.Popen，所以这里是一个短版，直接做：

PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"

import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))

Here is the Windows installation link for pdftoppm(contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/

这是 Windows 安装链接pdftoppm（包含在名为 poppler 的包中）：http: //blog.alivate.com.au/poppler-windows/

Answer 4

回答by DevB2F

There is no need to install Poppler on your OS. This will work:

无需在您的操作系统上安装 Poppler。这将起作用：

pip install Wand

pip 安装魔杖

from wand.image import Image

f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source: 
    for i, image in enumerate(source.sequence):
        newfilename = f[:-4] + str(i + 1) + '.jpeg'
        Image(image).save(filename=newfilename)

Answer 5

回答by photek1944

@gaurwraith, install poppler for Windows and use pdftoppm.exe as follows:

@gaurwraith，为 Windows 安装 poppler 并使用 pdftoppm.exe，如下所示：

Download zip file with Poppler's latest binaries/dlls from http://blog.alivate.com.au/poppler-windows/and unzip to a new folder in your program files folder. For example: "C:\Program Files (x86)\Poppler".
Add "C:\Program Files (x86)\Poppler\poppler-0.68.0\bin" to your SYSTEM PATH environment variable.
From cmd line install pdf2image module -> "pip install pdf2image".
Or alternatively, directly execute pdftoppm.exe from your code using Python's subprocess module as explained by user Basj.

从http://blog.alivate.com.au/poppler-windows/下载包含 Poppler 最新二进制文件/dll 的 zip 文件，然后解压缩到程序文件文件夹中的新文件夹。例如：“C:\Program Files (x86)\Poppler”。
将“C:\Program Files (x86)\Poppler\poppler-0.68.0\bin”添加到您的系统路径环境变量中。
从 cmd 行安装 pdf2image 模块 -> “pip install pdf2image”。
或者，如用户 Basj 所述，使用 Python 的 subprocess 模块直接从您的代码中执行 pdftoppm.exe。

@vishvAs vAsuki, this code should generate the jpgs you want through the subprocess module for all pages of one or more pdfs in a given folder:

@vishvAs vAsuki，此代码应通过 subprocess 模块为给定文件夹中一个或多个 pdf 的所有页面生成您想要的 jpg：

import os, subprocess

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"

for pdf_file in os.listdir(pdf_dir):

    if pdf_file.endswith(".pdf"):

        subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))

Or using the pdf2image module:

或者使用 pdf2image 模块：

import os
from pdf2image import convert_from_path

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

    for pdf_file in os.listdir(pdf_dir):

        if pdf_file.endswith(".pdf"):

            pages = convert_from_path(pdf_file, 300)
            pdf_file = pdf_file[:-4]

            for page in pages:

               page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

Answer 6

回答by duck

Their is a utility called pdftojpg which can be used to convert the pdf to img

他们是一个名为 pdftojpg 的实用程序，可用于将 pdf 转换为 img

You can found the code here https://github.com/pankajr141/pdf2jpg

你可以在这里找到代码https://github.com/pankajr141/pdf2jpg

from pdf2jpg import pdf2jpg
inputpath = r"D:\inputdir\pdf1.pdf"
outputpath = r"D:\outputdir"
# To convert single page
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1")
print(result)

# To convert multiple pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1,0,3")
print(result)

# to convert all pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="ALL")
print(result)

Answer 7

回答by Keval Dave

GhostScript performs much faster than Poppler for a Linux based system.

对于基于 Linux 的系统，GhostScript 的执行速度比 Poppler 快得多。

Following is the code for pdf to image conversion.

以下是pdf到图像转换的代码。

def get_image_page(pdf_file, out_file, page_num):
    page = str(page_num + 1)
    command = ["gs", "-q", "-dNOPAUSE", "-dBATCH", "-sDEVICE=png16m", "-r" + str(RESOLUTION), "-dPDFFitPage",
               "-sOutputFile=" + out_file, "-dFirstPage=" + page, "-dLastPage=" + page,
               pdf_file]
    f_null = open(os.devnull, 'w')
    subprocess.call(command, stdout=f_null, stderr=subprocess.STDOUT)

GhostScript can be installed on macOS using brew install ghostscript

GhostScript 可以安装在 macOS 上使用 brew install ghostscript

Installation information for other platforms can be found here. If it is not already installed on your system.

可以在此处找到其他平台的安装信息。如果它尚未安装在您的系统上。

Answer 8

回答by Saiprasad Bhatwadekar

from pdf2image import convert_from_path
import glob

pdf_dir = glob.glob(r'G:\personal\pdf\*')  #your pdf folder path
img_dir = "G:\personal\img\"           #your dest img path

for pdf_ in pdf_dir:
    pages = convert_from_path(pdf_, 500)
    for page in pages:
        page.save(img_dir+pdf_.split("\")[-1][:-3]+"jpg", 'JPEG')

Answer 9

回答by Robert

I use a (maybe) much simpler option of pdf2image:

我使用了一个（也许）更简单的 pdf2image 选项：

cd $dir
for f in *.pdf
do
  if [ -f "${f}" ]; then
    n=$(echo "$f" | cut -f1 -d'.')
    pdftoppm -scale-to 1440 -png $f $conv/$n
    rm $f
    mv  $conv/*.png $dir
  fi
done

This is a small part of a bash script in a loop for the use of a narrow casting device. Checks every 5 seconds on added pdf files (all) and processes them. This is for a demo device, at the end converting will be done at a remote server. Converting to .PNG now, but .JPG is possible too.

这是循环中 bash 脚本的一小部分，用于使用狭窄的投射设备。每 5 秒检查一次添加的 pdf 文件（全部）并处理它们。这是一个演示设备，最后转换将在远程服务器上完成。现在转换为 .PNG，但也可以转换为 .JPG。

This converting, together with transitions on A4 format, displaying a video, two smooth scrolling texts and a logo (with transition in three versions) sets the Pi3 to allmost 4x 100% cpu-load ;-)

这种转换，连同 A4 格式的转换，显示视频、两个平滑滚动的文本和一个标志（三个版本的转换）将 Pi3 设置为几乎 4x 100% cpu-load ;-)

Answer 10

回答by moo5e

Here is a solution which requires no additional libraries and is very fast. This was found from: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html#I have added the code in a function to make it more convenient.

这是一个不需要额外库并且速度非常快的解决方案。这是从以下位置找到的：https: //nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html#我已将代码添加到函数中以使其更方便。

def convert(filepath):
    with open(filepath, "rb") as file:
        pdf = file.read()

    startmark = b"\xff\xd8"
    startfix = 0
    endmark = b"\xff\xd9"
    endfix = 2
    i = 0

    njpg = 0
    while True:
        istream = pdf.find(b"stream", i)
        if istream < 0:
            break
        istart = pdf.find(startmark, istream, istream + 20)
        if istart < 0:
            i = istream + 20
            continue
        iend = pdf.find(b"endstream", istart)
        if iend < 0:
            raise Exception("Didn't find end of stream!")
        iend = pdf.find(endmark, iend - 20)
        if iend < 0:
            raise Exception("Didn't find end of JPG!")

        istart += startfix
        iend += endfix
        jpg = pdf[istart:iend]
        newfile = "{}jpg".format(filepath[:-3])
        with open(newfile, "wb") as jpgfile:
            jpgfile.write(jpg)

        njpg += 1
        i = iend

        return newfile

Call convert with the pdf path as the argument and the function will create a .jpg file in the same directory

以 pdf 路径为参数调用 convert 函数将在同一目录中创建一个 .jpg 文件

Python 从 pdf 中提取页面作为 jpeg

提问by vishvAs vAsuki

回答by Keval Dave

回答by JJPty

回答by Basj

回答by DevB2F

回答by photek1944

回答by duck

回答by Keval Dave

回答by Saiprasad Bhatwadekar

回答by Robert

回答by moo5e

相关推荐

最近更新

标签

Python 从 pdf 中提取页面作为 jpeg

提问by vishvAs vAsuki

回答by Keval Dave

回答by JJPty

回答by Basj

回答by DevB2F

回答by photek1944

回答by duck

回答by Keval Dave

回答by Saiprasad Bhatwadekar

回答by Robert

回答by moo5e

相关推荐

Python 无法在 Windows 10 上安装 PIP

Python：根据类无效的RGBA参数0.0色点

Python PyCharm 内存不足

Python cx_Oracle.DatabaseError: DPI-1047: 无法加载 64 位 Oracle 客户端库：“dlopen(libclntsh.dylib, 1): image not found”

相关推荐

最近更新

标签