Python pypdf 将多个pdf文件合并为一个pdf

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17104926/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:28:11  来源:igfitidea点击:

pypdf Merging multiple pdf files into one pdf

pythonpypdf

提问by daydaysay

If I have 1000+ pdf files need to be merged into one pdf,

如果我有 1000 多个 pdf 文件需要合并为一个 pdf,

input = PdfFileReader()
output = PdfFileWriter()
filename0000 ----- filename 1000
    input = PdfFileReader(file(filename, "rb"))
    pageCount = input.getNumPages()
    for iPage in range(0, pageCount):
        output.addPage(input.getPage(iPage))
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()

Execute the above code,when input = PdfFileReader(file(filename500+, "rb")),

执行上面的代码,当input = PdfFileReader(file(filename500+, "rb")),

An error message: IOError: [Errno 24] Too many open files:

错误提示: IOError: [Errno 24] Too many open files:

I think this is a bug, If not, What should I do?

我认为这是一个错误,如果不是,我该怎么办?

采纳答案by Rejected

I recently came across this exact same problem, so I dug into PyPDF2 to see what's going on, and how to resolve it.

我最近遇到了这个完全相同的问题,所以我深入研究了 PyPDF2,看看发生了什么,以及如何解决它。

Note: I am assuming that filenameis a well-formed file path string. Assume the same for all of my code

注意:我假设这filename是一个格式良好的文件路径字符串。假设我的所有代码都相同

The Short Answer

简短的回答

Use the PdfFileMerger()class instead of the PdfFileWriter()class. I've tried to provide the following to as closely resemble your content as I could:

使用PdfFileMerger()类而不是PdfFileWriter()类。我已尝试提供以下内容以尽可能与您的内容相似:

from PyPDF2 import PdfFileMerger, PdfFileReader

[...]

merger = PdfFileMerger()
for filename in filenames:
    merger.append(PdfFileReader(file(filename, 'rb')))

merger.write("document-output.pdf")

The Long Answer

长答案

The way you're using PdfFileReaderand PdfFileWriteris keeping each file open, and eventually causing Python to generate IOError 24. To be more specific, when you add a page to the PdfFileWriter, you are adding references to the page in the open PdfFileReader(hence the noted IO Error if you close the file). Python detects the file to still be referenced and doesn't do any garbage collection / automatic file closing despite re-using the file handle. They remain open until PdfFileWriterno longer needs access to them, which is at output.write(outputStream)in your code.

您使用PdfFileReaderPdfFileWriter保持每个文件打开的方式,并最终导致 Python 生成 IOError 24。更具体地说,当您向 中添加页面时PdfFileWriter,您正在添加对打开的页面的引用PdfFileReader(因此注意到 IO关闭文件时出错)。尽管重新使用文件句柄,Python 仍会检测到该文件仍被引用,并且不会进行任何垃圾收集/自动关闭文件。它们保持打开状态,直到PdfFileWriter不再需要访问它们,这output.write(outputStream)在您的代码中。

To solve this, create copies in memory of the content, and allow the file to be closed. I noticed in my adventures through the PyPDF2 code that the PdfFileMerger()class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead. I learned, though, that my initial look at PdfFileMergerwasn't close enough, and that it only created copies in certain conditions.

要解决此问题,请在内容的内存中创建副本,并允许关闭文件。我在通过 PyPDF2 代码的冒险中注意到PdfFileMerger()该类已经具有此功能,因此我没有重新发明轮子,而是选择使用它。但是,我了解到,我最初的观察PdfFileMerger不够仔细,而且它仅在某些条件下创建了副本。

My initial attempts looked like the following, and were resulting in the same IO Problems:

我最初的尝试如下所示,并导致了相同的 IO 问题:

merger = PdfFileMerger()
for filename in filenames:
    merger.append(filename)

merger.write(output_file_path)

Looking at the PyPDF2 source code, we see that append()requires fileobjto be passed, and then uses the merge()function, passing in it's last page as the new files position. merge()does the following with fileobj(before opening it with PdfFileReader(fileobj):

查看 PyPDF2 源代码,我们看到append()需要fileobj传递,然后使用该merge()函数,将其最后一页作为新文件位置传递。merge()执行以下操作fileobj(在使用以下打开它之前PdfFileReader(fileobj)

    if type(fileobj) in (str, unicode):
        fileobj = file(fileobj, 'rb')
        my_file = True
    elif type(fileobj) == file:
        fileobj.seek(0)
        filecontent = fileobj.read()
        fileobj = StringIO(filecontent)
        my_file = True
    elif type(fileobj) == PdfFileReader:
        orig_tell = fileobj.stream.tell()   
        fileobj.stream.seek(0)
        filecontent = StringIO(fileobj.stream.read())
        fileobj.stream.seek(orig_tell)
        fileobj = filecontent
        my_file = True

We can see that the append()option does accept a string, and when doing so, assumes it's a file path and creates a file object at that location. The end result is the exact same thing we're trying to avoid. A PdfFileReader()object holding open a file until the file is eventually written!

我们可以看到该append()选项确实接受一个字符串,并且在这样做时假定它是一个文件路径并在该位置创建一个文件对象。最终结果与我们试图避免的完全相同。一个PdfFileReader()对象保持打开文件直到文件最终被写入!

However, if we either make a file object of the file path string ora PdfFileReader(see Edit 2)object of the path string beforeit gets passed into append(), it will automatically create a copy for us as a StringIOobject, allowing Python to close the file.

但是,如果我们在将文件路径字符串传递给之前创建文件路径字符串的文件对象路径字符串的PdfFileReader(参见编辑 2)对象,它将自动为我们创建一个副本作为对象,从而允许 Python 关闭文件.append()StringIO

I would recommend the simpler merger.append(file(filename, 'rb')), as others have reported that a PdfFileReaderobject may stay open in memory, even after calling writer.close().

我会推荐更简单的merger.append(file(filename, 'rb')),因为其他人报告说PdfFileReader对象可能在内存中保持打开状态,即使在调用writer.close().

Hope this helped!

希望这有帮助!

EDIT:I assumed you were using PyPDF2, not PyPDF. If you aren't, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.

编辑:我假设您使用的是PyPDF2,而不是PyPDF. 如果你不是,我强烈建议你切换,因为 PyPDF 不再维护,作者在开发 PyPDF2 时给予 Phaseit 官方祝福。

If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than PdfFileMergerwon't be available to you. In that situation you can re-use the code from PyPDF2's mergefunction (provided above) to create a copy of the file as a StringIOobject, and use that in your code in place of the file object.

如果由于某种原因您无法切换到 PyPDF2(许可、系统限制等)PdfFileMerger,那么您将无法使用。在这种情况下,您可以重用 PyPDF2merge函数(上面提供)中的代码来创建文件副本作为StringIO对象,并在代码中使用它代替文件对象。

EDIT 2:Previous recommendation of using merger.append(PdfFileReader(file(filename, 'rb')))changed based on comments (Thanks @Agostino).

编辑 2:以前merger.append(PdfFileReader(file(filename, 'rb')))根据评论使用更改的建议(感谢@Agostino)

回答by flyingfoxlee

It maybe just what it says, you are opening to many files. You may explicitly use f=file(filename) ... f.close()in the loop, or use the withstatement. So that each opened file is properly closed.

也许正如它所说的那样,您正在打开许多文件。您可以f=file(filename) ... f.close()在循环中显式使用,或使用with语句。以便正确关闭每个打开的文件。

回答by sgillis

The problem is that you are only allowed to have a certain number of files open at any given time. There are ways to change this (http://docs.python.org/3/library/resource.html#resource.getrlimit), but I don't think you need this.

问题是您只能在任何给定时间打开一定数量的文件。有很多方法可以改变这个(http://docs.python.org/3/library/resource.html#resource.getrlimit),但我认为你不需要这个。

What you could try is closing the files in the for loop:

您可以尝试关闭 for 循环中的文件:

input = PdfFileReader()
output = PdfFileWriter()
for file in filenames:
   f = open(file, 'rb')
   input = PdfFileReader(f)
   # Some code
   f.close()

回答by Patrick Maupin

The pdfrw package reads each file all in one go, so will not suffer from the problem of too many open files. Hereis an example concatenation script.

pdfrw 包一次读取每个文件,因此不会遇到打开文件过多的问题。 是一个示例串联脚本。

The relevant part -- assumes inputsis a list of input filenames, and outfnis an output file name:

相关部分 - 假设inputs是输入文件名列表,并且outfn是输出文件名:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

Disclaimer: I am the primary pdfrw author.

免责声明:我是 pdfrw 的主要作者。

回答by Souravi Sinha

I have written this code to help with the answer:-

我写了这段代码来帮助回答:-

import sys
import os
import PyPDF2

merger = PyPDF2.PdfFileMerger()

#get PDFs files and path

path = sys.argv[1]
pdfs = sys.argv[2:]
os.chdir(path)


#iterate among the documents
for pdf in pdfs:
    try:
        #if doc exist then merge
        if os.path.exists(pdf):
            input = PyPDF2.PdfFileReader(open(pdf,'rb'))
            merger.append((input))
        else:
            print(f"problem with file {pdf}")

    except:
            print("cant merge !! sorry")
    else:
            print(f" {pdf} Merged !!! ")

merger.write("Merged_doc.pdf")

In this, I have used PyPDF2.PdfFileMerger and PyPDF2.PdfFileReader, instead of explicitly converting the file name to file object

在此,我使用了 PyPDF2.PdfFileMerger 和 PyPDF2.PdfFileReader,而不是将文件名显式转换为文件对象