Python 合并 PDF 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3444645/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:08:35  来源:igfitidea点击:

Merge PDF files

pythonpdffile-io

提问by Btibert3

Is it possible, using Python, to merge separate PDF files?

是否可以使用 Python 合并单独的 PDF 文件?

Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure.

假设是这样,我需要进一步扩展它。我希望遍历目录中的文件夹并重复此过程。

And I may be pushing my luck, but is it possible to exclude a page that is contained in of the PDFs (my report generation always creates an extra blank page).

我可能会碰运气,但是是否可以排除 PDF 中包含的页面(我的报告生成总是创建一个额外的空白页面)。

采纳答案by Gilles 'SO- stop being evil'

Use Pypdfor its successor PyPDF2:

使用Pypdf或其继承者PyPDF2

A Pure-Python library built as a PDF toolkit. It is capable of:
* splitting documents page by page,
* merging documents page by page,

作为 PDF 工具包构建的纯 Python 库。它能够:
* 逐页拆分文档,
*逐页合并文档,

(and much more)

(以及更多)

Here's a sample program that works with both versions.

这是一个适用于两个版本的示例程序。

#!/usr/bin/env python
import sys
try:
    from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
    from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
    input_streams = []
    try:
        # First open all the files, then produce the output file, and
        # finally close the input files. This is necessary because
        # the data isn't read from the input files until the write
        # operation. Thanks to
        # https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
        for input_file in input_files:
            input_streams.append(open(input_file, 'rb'))
        writer = PdfFileWriter()
        for reader in map(PdfFileReader, input_streams):
            for n in range(reader.getNumPages()):
                writer.addPage(reader.getPage(n))
        writer.write(output_stream)
    finally:
        for f in input_streams:
            f.close()

if __name__ == '__main__':
    if sys.platform == "win32":
        import os, msvcrt
        msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    pdf_cat(sys.argv[1:], sys.stdout)

回答by Martin Thoma

Is it possible, using Python, to merge seperate PDF files?

是否可以使用 Python 合并单独的 PDF 文件?

Yes.

是的。

The following example merges all files in one folder to a single new PDF file:

以下示例将一个文件夹中的所有文件合并为一个新的 PDF 文件:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os

def merge(path, output_filename):
    output = PdfFileWriter()

    for pdffile in glob(path + os.sep + '*.pdf'):
        if pdffile == output_filename:
            continue
        print("Parse '%s'" % pdffile)
        document = PdfFileReader(open(pdffile, 'rb'))
        for i in range(document.getNumPages()):
            output.addPage(document.getPage(i))

    print("Start writing '%s'" % output_filename)
    with open(output_filename, "wb") as f:
        output.write(f)

if __name__ == "__main__":
    parser = ArgumentParser()

    # Add more options if you like
    parser.add_argument("-o", "--output",
                        dest="output_filename",
                        default="merged.pdf",
                        help="write merged PDF to FILE",
                        metavar="FILE")
    parser.add_argument("-p", "--path",
                        dest="path",
                        default=".",
                        help="path of source PDF files")

    args = parser.parse_args()
    merge(args.path, args.output_filename)

回答by Mark K

here, http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/, gives an solution.

在这里,http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/给出了解决方案。

similarly:

相似地:

from pyPdf import PdfFileWriter, PdfFileReader

def append_pdf(input,output):
    [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]

output = PdfFileWriter()

append_pdf(PdfFileReader(file("C:\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\sample3.pdf","rb")),output)

    output.write(file("c:\combined.pdf","wb"))

回答by Paul Rooney

You can use PyPdf2s PdfMergerclass.

您可以使用PyPdf2小号PdfMerger类。

File Concatenation

文件串联

You can simply concatenatefiles by using the appendmethod.

您可以使用方法简单地连接文件append

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("result.pdf")
merger.close()

You can pass file handles instead file paths if you want.

如果需要,您可以传递文件句柄而不是文件路径。

File Merging

文件合并

If you want more fine grained control of merging there is a mergemethod of the PdfMerger, which allows you to specify an insertion point in the output file, meaning you can insert the pages anywhere in the file. The appendmethod can be thought of as a mergewhere the insertion point is the end of the file.

如果您想对合并进行更细粒度的控制,可以使用 的merge方法PdfMerger,它允许您在输出文件中指定一个插入点,这意味着您可以在文件中的任何位置插入页面。该append方法可以被认为是merge插入点是文件末尾的地方。

e.g.

例如

merger.merge(2, pdf)

Here we insert the whole pdf into the output but at page 2.

在这里,我们将整个 pdf 插入到输出中,但在第 2 页。

Page Ranges

页面范围

If you wish to control which pages are appended from a particular file, you can use the pageskeyword argument of appendand merge, passing a tuple in the form (start, stop[, step])(like the regular rangefunction).

如果您希望控制从特定文件附加哪些页面,您可以使用and的pages关键字参数,在表单中传递一个元组(如常规函数)。appendmerge(start, stop[, step])range

e.g.

例如

merger.append(pdf, pages=(0, 3))    # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5

If you specify an invalid range you will get an IndexError.

如果您指定无效范围,您将得到一个IndexError.

Note:also that to avoid files being left open, the PdfFileMergers close method should be called when the merged file has been written. This ensures all files are closed (input and output) in a timely manner. It's a shame that PdfFileMergerisn't implemented as a context manager, so we can use the withkeyword, avoid the explicit close call and get some easy exception safety.

注意:同样为了避免文件被打开,PdfFileMerger当合并文件被写入时应该调用 s close 方法。这可确保及时关闭所有文件(输入和输出)。很遗憾PdfFileMerger没有实现为上下文管理器,因此我们可以使用with关键字,避免显式关闭调用并获得一些简单的异常安全。

You might also want to look at the pdfcatscript provided as part of pypdf2. You can potentially avoid the need to write code altogether.

您可能还想查看pdfcat作为 pypdf2 的一部分提供的脚本。您可以完全避免编写代码的需要。

The PyPdf2 github also includessome example code demonstrating merging.

PyPdf2 github 还包括一些演示合并的示例代码。

回答by Patrick Maupin

The pdfrwlibrarycan do this quite easily, assuming you don't need to preserve bookmarks and annotations, and your PDFs aren't encrypted. cat.pyis an example concatenation script, and subset.pyis an example page subsetting script.

pdfrw可以做到这一点很容易,假设你并不需要保存书签和注释,以及您的PDF不加密。 cat.py是一个示例串联脚本,并且subset.py是一个示例页面子集脚本。

The relevant part of the concatenation script -- assumes inputsis a list of input filenames, and outfnis an output file name:

连接脚本的相关部分 - 假设inputs是输入文件名列表,并且outfn是输出文件名:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

As you can see from this, it would be pretty easy to leave out the last page, e.g. something like:

正如你所看到的,很容易省略最后一页,例如:

    writer.addpages(PdfReader(inpfn).pages[:-1])

Disclaimer: I am the primary pdfrwauthor.

免责声明:我是主要pdfrw作者。

回答by Giovanni G. PY

Merge all pdf files that are present in a dir

合并目录中存在的所有pdf文件

Put the pdf files in a dir. Launch the program. You get one pdf with all the pdfs merged.

将pdf文件放在一个目录中。启动程序。你会得到一个 pdf,其中所有的 pdf 都合并了。

import os
from PyPDF2 import PdfFileMerger

x = [a for a in os.listdir() if a.endswith(".pdf")]

merger = PdfFileMerger()

for pdf in x:
    merger.append(open(pdf, 'rb'))

with open("result.pdf", "wb") as fout:
    merger.write(fout)

回答by guruprasad mulay

from PyPDF2 import PdfFileMerger
import webbrowser
import os
dir_path = os.path.dirname(os.path.realpath(__file__))

def list_files(directory, extension):
    return (f for f in os.listdir(directory) if f.endswith('.' + extension))

pdfs = list_files(dir_path, "pdf")

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('result.pdf', 'wb') as fout:
    merger.write(fout)

webbrowser.open_new('file://'+ dir_path + '/result.pdf')

Git Repo: https://github.com/mahaguru24/Python_Merge_PDF.git

Git 仓库:https: //github.com/mahaguru24/Python_Merge_PDF.git

回答by Ogaga Uzoh

A slight variation using a dictionary for greater flexibility (e.g. sort, dedup):

使用字典的细微变化以获得更大的灵活性(例如排序、重复数据删除):

import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
    for file in files:
        filepath = subdir + os.sep + file
        # you can have multiple endswith
        if filepath.endswith((".pdf", ".PDF")):
            file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)

for k, v in file_dict.items():
    print(k, v)
    merger.append(v)

merger.write("combined_result.pdf")

回答by user8291021

I used pdf unite on the linux terminal by leveraging subprocess (assumes one.pdf and two.pdf exist on the directory) and the aim is to merge them to three.pdf

我通过利用子进程在linux终端上使用pdf unite(假设目录中存在one.pdf和two.pdf),目的是将它们合并到three.pdf

 import subprocess
 subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)