Python 从 pdf 中提取页面作为 jpeg
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46184239/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract a page from a pdf as a jpeg
提问by vishvAs vAsuki
In python code, how to efficiently save a certain page in a pdf as a jpeg file? (Use case: I've a python flask web server where pdf-s will be uploaded and jpeg-s corresponding to each page is stores.)
在python代码中,如何有效地将pdf中的某个页面保存为jpeg文件?(用例:我有一个 python Flask Web 服务器,其中 pdf-s 将被上传,每个页面对应的 jpeg-s 是存储。)
This solutionis close, but the problem is that it does not convert the entire page to jpeg.
这个解决方案很接近,但问题是它没有将整个页面转换为jpeg。
回答by Keval Dave
The pdf2image library can be used.
可以使用 pdf2image 库。
You can install it simply using,
您可以简单地使用安装它,
pip install pdf2image
Once installed you can use following code to get images.
安装后,您可以使用以下代码获取图像。
from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)
Saving pages in jpeg format
以 jpeg 格式保存页面
for page in pages:
page.save('out.jpg', 'JPEG')
Edit: the Github repo pdf2imagealso mentions that it uses pdftoppm
and that it requires other installations:
编辑:Github repo pdf2image还提到它使用pdftoppm
并且需要其他安装:
pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run
sudo apt install poppler-utils
.
pdftoppm 是一款可以发挥实际作用的软件。它作为名为poppler的更大包的一部分分发。Windows 用户必须为 Windows安装poppler。Mac 用户必须为 Mac安装poppler。Linux 用户将在发行版中预安装 pdftoppm(在 Ubuntu 和 Archlinux 上测试),如果不是,请运行
sudo apt install poppler-utils
.
You can install the latest version under Windows using anaconda by doing:
您可以通过执行以下操作使用 anaconda 在 Windows 下安装最新版本:
conda install -c conda-forge poppler
note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/but note that 0.68 was released in Aug 2018so you'll not be getting the latest features or bug fixes.
注意:http://blog.alivate.com.au/poppler-windows/ 上提供了高达 0.67 的 Windows 版本,但请注意,0.68 于2018 年 8 月发布,因此您将无法获得最新功能或错误修复。
回答by JJPty
I found this simple solution, PyMuPDF, output to png file. Note the library is imported as "fitz", a historical name for the rendering engine it uses.
我发现这个简单的解决方案PyMuPDF输出到 png 文件。请注意,该库被导入为“fitz”,这是它使用的渲染引擎的历史名称。
import fitz
pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.loadPage(0) #number of page
pix = page.getPixmap()
output = "outfile.png"
pix.writePNG(output)
回答by Basj
The Python library pdf2image
(used in the other answer) in fact doesn't do much more than just launchingpdttoppm
with subprocess.Popen
, so here is a short version doing it directly:
Python库pdf2image
其实(在对方的回答中)没有做远不止推出pdttoppm
有subprocess.Popen
,所以这里是一个短版,直接做:
PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"
import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))
Here is the Windows installation link for pdftoppm
(contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/
这是 Windows 安装链接pdftoppm
(包含在名为 poppler 的包中):http: //blog.alivate.com.au/poppler-windows/
回答by DevB2F
There is no need to install Poppler on your OS. This will work:
无需在您的操作系统上安装 Poppler。这将起作用:
pip install Wand
pip 安装魔杖
from wand.image import Image
f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source:
for i, image in enumerate(source.sequence):
newfilename = f[:-4] + str(i + 1) + '.jpeg'
Image(image).save(filename=newfilename)
回答by photek1944
@gaurwraith, install poppler for Windows and use pdftoppm.exe as follows:
@gaurwraith,为 Windows 安装 poppler 并使用 pdftoppm.exe,如下所示:
Download zip file with Poppler's latest binaries/dlls from http://blog.alivate.com.au/poppler-windows/and unzip to a new folder in your program files folder. For example: "C:\Program Files (x86)\Poppler".
Add "C:\Program Files (x86)\Poppler\poppler-0.68.0\bin" to your SYSTEM PATH environment variable.
From cmd line install pdf2image module -> "pip install pdf2image".
- Or alternatively, directly execute pdftoppm.exe from your code using Python's subprocess module as explained by user Basj.
从http://blog.alivate.com.au/poppler-windows/下载包含 Poppler 最新二进制文件/dll 的 zip 文件,然后解压缩到程序文件文件夹中的新文件夹。例如:“C:\Program Files (x86)\Poppler”。
将“C:\Program Files (x86)\Poppler\poppler-0.68.0\bin”添加到您的系统路径环境变量中。
从 cmd 行安装 pdf2image 模块 -> “pip install pdf2image”。
- 或者,如用户 Basj 所述,使用 Python 的 subprocess 模块直接从您的代码中执行 pdftoppm.exe。
@vishvAs vAsuki, this code should generate the jpgs you want through the subprocess module for all pages of one or more pdfs in a given folder:
@vishvAs vAsuki,此代码应通过 subprocess 模块为给定文件夹中一个或多个 pdf 的所有页面生成您想要的 jpg:
import os, subprocess
pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)
pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf"):
subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))
Or using the pdf2image module:
或者使用 pdf2image 模块:
import os
from pdf2image import convert_from_path
pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf"):
pages = convert_from_path(pdf_file, 300)
pdf_file = pdf_file[:-4]
for page in pages:
page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")
回答by duck
Their is a utility called pdftojpg which can be used to convert the pdf to img
他们是一个名为 pdftojpg 的实用程序,可用于将 pdf 转换为 img
You can found the code here https://github.com/pankajr141/pdf2jpg
你可以在这里找到代码https://github.com/pankajr141/pdf2jpg
from pdf2jpg import pdf2jpg
inputpath = r"D:\inputdir\pdf1.pdf"
outputpath = r"D:\outputdir"
# To convert single page
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1")
print(result)
# To convert multiple pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1,0,3")
print(result)
# to convert all pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="ALL")
print(result)
回答by Keval Dave
GhostScript performs much faster than Poppler for a Linux based system.
对于基于 Linux 的系统,GhostScript 的执行速度比 Poppler 快得多。
Following is the code for pdf to image conversion.
以下是pdf到图像转换的代码。
def get_image_page(pdf_file, out_file, page_num):
page = str(page_num + 1)
command = ["gs", "-q", "-dNOPAUSE", "-dBATCH", "-sDEVICE=png16m", "-r" + str(RESOLUTION), "-dPDFFitPage",
"-sOutputFile=" + out_file, "-dFirstPage=" + page, "-dLastPage=" + page,
pdf_file]
f_null = open(os.devnull, 'w')
subprocess.call(command, stdout=f_null, stderr=subprocess.STDOUT)
GhostScript can be installed on macOS using brew install ghostscript
GhostScript 可以安装在 macOS 上使用 brew install ghostscript
Installation information for other platforms can be found here. If it is not already installed on your system.
可以在此处找到其他平台的安装信息。如果它尚未安装在您的系统上。
回答by Saiprasad Bhatwadekar
from pdf2image import convert_from_path
import glob
pdf_dir = glob.glob(r'G:\personal\pdf\*') #your pdf folder path
img_dir = "G:\personal\img\" #your dest img path
for pdf_ in pdf_dir:
pages = convert_from_path(pdf_, 500)
for page in pages:
page.save(img_dir+pdf_.split("\")[-1][:-3]+"jpg", 'JPEG')
回答by Robert
I use a (maybe) much simpler option of pdf2image:
我使用了一个(也许)更简单的 pdf2image 选项:
cd $dir
for f in *.pdf
do
if [ -f "${f}" ]; then
n=$(echo "$f" | cut -f1 -d'.')
pdftoppm -scale-to 1440 -png $f $conv/$n
rm $f
mv $conv/*.png $dir
fi
done
This is a small part of a bash script in a loop for the use of a narrow casting device. Checks every 5 seconds on added pdf files (all) and processes them. This is for a demo device, at the end converting will be done at a remote server. Converting to .PNG now, but .JPG is possible too.
这是循环中 bash 脚本的一小部分,用于使用狭窄的投射设备。每 5 秒检查一次添加的 pdf 文件(全部)并处理它们。这是一个演示设备,最后转换将在远程服务器上完成。现在转换为 .PNG,但也可以转换为 .JPG。
This converting, together with transitions on A4 format, displaying a video, two smooth scrolling texts and a logo (with transition in three versions) sets the Pi3 to allmost 4x 100% cpu-load ;-)
这种转换,连同 A4 格式的转换,显示视频、两个平滑滚动的文本和一个标志(三个版本的转换)将 Pi3 设置为几乎 4x 100% cpu-load ;-)
回答by moo5e
Here is a solution which requires no additional libraries and is very fast. This was found from: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html#I have added the code in a function to make it more convenient.
这是一个不需要额外库并且速度非常快的解决方案。这是从以下位置找到的:https: //nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html#我已将代码添加到函数中以使其更方便。
def convert(filepath):
with open(filepath, "rb") as file:
pdf = file.read()
startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
jpg = pdf[istart:iend]
newfile = "{}jpg".format(filepath[:-3])
with open(newfile, "wb") as jpgfile:
jpgfile.write(jpg)
njpg += 1
i = iend
return newfile
Call convert with the pdf path as the argument and the function will create a .jpg file in the same directory
以 pdf 路径为参数调用 convert 函数将在同一目录中创建一个 .jpg 文件