Python 合并 PDF 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3444645/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Merge PDF files
提问by Btibert3
Is it possible, using Python, to merge separate PDF files?
是否可以使用 Python 合并单独的 PDF 文件?
Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure.
假设是这样,我需要进一步扩展它。我希望遍历目录中的文件夹并重复此过程。
And I may be pushing my luck, but is it possible to exclude a page that is contained in of the PDFs (my report generation always creates an extra blank page).
我可能会碰运气,但是是否可以排除 PDF 中包含的页面(我的报告生成总是创建一个额外的空白页面)。
采纳答案by Gilles 'SO- stop being evil'
Use Pypdfor its successor PyPDF2:
A Pure-Python library built as a PDF toolkit. It is capable of:
* splitting documents page by page,
* merging documents page by page,
作为 PDF 工具包构建的纯 Python 库。它能够:
* 逐页拆分文档,
*逐页合并文档,
(and much more)
(以及更多)
Here's a sample program that works with both versions.
这是一个适用于两个版本的示例程序。
#!/usr/bin/env python
import sys
try:
from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
from pyPdf import PdfFileReader, PdfFileWriter
def pdf_cat(input_files, output_stream):
input_streams = []
try:
# First open all the files, then produce the output file, and
# finally close the input files. This is necessary because
# the data isn't read from the input files until the write
# operation. Thanks to
# https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
for input_file in input_files:
input_streams.append(open(input_file, 'rb'))
writer = PdfFileWriter()
for reader in map(PdfFileReader, input_streams):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
writer.write(output_stream)
finally:
for f in input_streams:
f.close()
if __name__ == '__main__':
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
pdf_cat(sys.argv[1:], sys.stdout)
回答by Martin Thoma
Is it possible, using Python, to merge seperate PDF files?
是否可以使用 Python 合并单独的 PDF 文件?
Yes.
是的。
The following example merges all files in one folder to a single new PDF file:
以下示例将一个文件夹中的所有文件合并为一个新的 PDF 文件:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os
def merge(path, output_filename):
output = PdfFileWriter()
for pdffile in glob(path + os.sep + '*.pdf'):
if pdffile == output_filename:
continue
print("Parse '%s'" % pdffile)
document = PdfFileReader(open(pdffile, 'rb'))
for i in range(document.getNumPages()):
output.addPage(document.getPage(i))
print("Start writing '%s'" % output_filename)
with open(output_filename, "wb") as f:
output.write(f)
if __name__ == "__main__":
parser = ArgumentParser()
# Add more options if you like
parser.add_argument("-o", "--output",
dest="output_filename",
default="merged.pdf",
help="write merged PDF to FILE",
metavar="FILE")
parser.add_argument("-p", "--path",
dest="path",
default=".",
help="path of source PDF files")
args = parser.parse_args()
merge(args.path, args.output_filename)
回答by Mark K
here, http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/, gives an solution.
在这里,http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/给出了解决方案。
similarly:
相似地:
from pyPdf import PdfFileWriter, PdfFileReader
def append_pdf(input,output):
[output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]
output = PdfFileWriter()
append_pdf(PdfFileReader(file("C:\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\sample3.pdf","rb")),output)
output.write(file("c:\combined.pdf","wb"))
回答by Paul Rooney
You can use PyPdf2s PdfMergerclass.
File Concatenation
文件串联
You can simply concatenatefiles by using the appendmethod.
from PyPDF2 import PdfFileMerger
pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("result.pdf")
merger.close()
You can pass file handles instead file paths if you want.
如果需要,您可以传递文件句柄而不是文件路径。
File Merging
文件合并
If you want more fine grained control of merging there is a mergemethod of the PdfMerger, which allows you to specify an insertion point in the output file, meaning you can insert the pages anywhere in the file. The appendmethod can be thought of as a mergewhere the insertion point is the end of the file.
如果您想对合并进行更细粒度的控制,可以使用 的merge方法PdfMerger,它允许您在输出文件中指定一个插入点,这意味着您可以在文件中的任何位置插入页面。该append方法可以被认为是merge插入点是文件末尾的地方。
e.g.
例如
merger.merge(2, pdf)
Here we insert the whole pdf into the output but at page 2.
在这里,我们将整个 pdf 插入到输出中,但在第 2 页。
Page Ranges
页面范围
If you wish to control which pages are appended from a particular file, you can use the pageskeyword argument of appendand merge, passing a tuple in the form (start, stop[, step])(like the regular rangefunction).
如果您希望控制从特定文件附加哪些页面,您可以使用and的pages关键字参数,在表单中传递一个元组(如常规函数)。appendmerge(start, stop[, step])range
e.g.
例如
merger.append(pdf, pages=(0, 3)) # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5
If you specify an invalid range you will get an IndexError.
如果您指定无效范围,您将得到一个IndexError.
Note:also that to avoid files being left open, the PdfFileMergers close method should be called when the merged file has been written. This ensures all files are closed (input and output) in a timely manner. It's a shame that PdfFileMergerisn't implemented as a context manager, so we can use the withkeyword, avoid the explicit close call and get some easy exception safety.
注意:同样为了避免文件被打开,PdfFileMerger当合并文件被写入时应该调用 s close 方法。这可确保及时关闭所有文件(输入和输出)。很遗憾PdfFileMerger没有实现为上下文管理器,因此我们可以使用with关键字,避免显式关闭调用并获得一些简单的异常安全。
You might also want to look at the pdfcatscript provided as part of pypdf2. You can potentially avoid the need to write code altogether.
您可能还想查看pdfcat作为 pypdf2 的一部分提供的脚本。您可以完全避免编写代码的需要。
The PyPdf2 github also includessome example code demonstrating merging.
PyPdf2 github 还包括一些演示合并的示例代码。
回答by Patrick Maupin
The pdfrwlibrarycan do this quite easily, assuming you don't need to preserve bookmarks and annotations, and your PDFs aren't encrypted. cat.pyis an example concatenation script, and subset.pyis an example page subsetting script.
该pdfrw库可以做到这一点很容易,假设你并不需要保存书签和注释,以及您的PDF不加密。 cat.py是一个示例串联脚本,并且subset.py是一个示例页面子集脚本。
The relevant part of the concatenation script -- assumes inputsis a list of input filenames, and outfnis an output file name:
连接脚本的相关部分 - 假设inputs是输入文件名列表,并且outfn是输出文件名:
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
for inpfn in inputs:
writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)
As you can see from this, it would be pretty easy to leave out the last page, e.g. something like:
正如你所看到的,很容易省略最后一页,例如:
writer.addpages(PdfReader(inpfn).pages[:-1])
Disclaimer: I am the primary pdfrwauthor.
免责声明:我是主要pdfrw作者。
回答by Giovanni G. PY
Merge all pdf files that are present in a dir
合并目录中存在的所有pdf文件
Put the pdf files in a dir. Launch the program. You get one pdf with all the pdfs merged.
将pdf文件放在一个目录中。启动程序。你会得到一个 pdf,其中所有的 pdf 都合并了。
import os
from PyPDF2 import PdfFileMerger
x = [a for a in os.listdir() if a.endswith(".pdf")]
merger = PdfFileMerger()
for pdf in x:
merger.append(open(pdf, 'rb'))
with open("result.pdf", "wb") as fout:
merger.write(fout)
回答by guruprasad mulay
from PyPDF2 import PdfFileMerger
import webbrowser
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
def list_files(directory, extension):
return (f for f in os.listdir(directory) if f.endswith('.' + extension))
pdfs = list_files(dir_path, "pdf")
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(open(pdf, 'rb'))
with open('result.pdf', 'wb') as fout:
merger.write(fout)
webbrowser.open_new('file://'+ dir_path + '/result.pdf')
Git Repo: https://github.com/mahaguru24/Python_Merge_PDF.git
回答by Ogaga Uzoh
A slight variation using a dictionary for greater flexibility (e.g. sort, dedup):
使用字典的细微变化以获得更大的灵活性(例如排序、重复数据删除):
import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
for file in files:
filepath = subdir + os.sep + file
# you can have multiple endswith
if filepath.endswith((".pdf", ".PDF")):
file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)
for k, v in file_dict.items():
print(k, v)
merger.append(v)
merger.write("combined_result.pdf")
回答by user8291021
I used pdf unite on the linux terminal by leveraging subprocess (assumes one.pdf and two.pdf exist on the directory) and the aim is to merge them to three.pdf
我通过利用子进程在linux终端上使用pdf unite(假设目录中存在one.pdf和two.pdf),目的是将它们合并到three.pdf
import subprocess
subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)

