如何在 Python 中解锁“安全”(读保护)PDF?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28192977/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:53:07  来源:igfitidea点击:

How to unlock a "secured" (read-protected) PDF in Python?

pythonpdfpdfminerpdf-scraping

提问by kramer65

In Python I'm using pdfminerto read the text from a pdf with the code below this message. I now get an error message saying:

在 Python 中,我使用pdfminer从带有此消息下方代码的 pdf 中读取文本。我现在收到一条错误消息:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>

When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this linkhowever, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.

当我用 Acrobat Pro 打开这个 pdf 时,它被证明是安全的(或“读保护”)。但是,从此链接中,我了解到有许多服务可以轻松禁用此读取保护(例如pdfunlock.com。深入研究 pdfminer 的源代码时,我看到上面的错误是在这些行上生成的。

if check_extractable and not doc.is_extractable:
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)

Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractableis a simple attribute of the doc, but I don't think it is as simple as changing .is_extractableto True..

由于有许多服务可以在一秒钟内禁用这种读保护,我认为这真的很容易做到。似乎.is_extractable是 的一个简单属性doc,但我不认为它像更改.is_extractable为 True..那样简单。

Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!

有人知道如何使用 Python 禁用 pdf 上的阅读保护吗?欢迎所有提示!

================================================

================================================

Below you will find the code with which I currently extract the text from non-read protected.

您将在下面找到我当前从非读保护中提取文本的代码。

def getTextFromPDF(rawFile):
    resourceManager = PDFResourceManager(caching=True)
    outfp = StringIO()
    device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
    interpreter = PDFPageInterpreter(resourceManager, device)

    fileData = StringIO()
    fileData.write(rawFile)
    for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    fileData.close()
    device.close()

    result = outfp.getvalue()

    outfp.close()
    return result

回答by Jaza

As far as I know, in most cases the full content of the PDF is actually encrypted, using the password as the encryption key, and so simply setting .is_extractableto Trueisn't going to help you.

据我所知,在大多数情况下,PDF 的全部内容实际上是加密的,使用密码作为加密密钥,因此简单地设置.is_extractableTrue对您没有帮助。

Per this thread:

每个线程:

Does a library exist to remove passwords from PDFs programmatically?

是否存在以编程方式从 PDF 中删除密码的库?

I would recommend removing the read-protection with a command-line tool such as qpdf(easily installable, e.g. on Ubuntu use apt-get install qpdfif you don't have it already):

我建议使用命令行工具删除读保护,例如qpdf(易于安装,例如,apt-get install qpdf如果您还没有它,请在 Ubuntu 上使用):

qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf

Then open the unlocked file with pdfminerand do your stuff.

然后打开解锁的文件pdfminer并做你的事情。

For a pure-Python solution, you can try using PyPDF2and its .decrypt()method, but it doesn't work with all types of encryption, so really, you're better off just using qpdf- see:

对于纯 Python 解决方案,您可以尝试使用PyPDF2它的.decrypt()方法,但它不适用于所有类型的加密,因此实际上,您最好只使用qpdf- 请参阅:

https://github.com/mstamy2/PyPDF2/issues/53

https://github.com/mstamy2/PyPDF2/issues/53

回答by jtlz2

In my case there was no password, but simply setting check_extractable=Falsecircumvented the PDFTextExtractionNotAllowedexception for a problematic file (that opened fine in other viewers).

在我的情况下,没有密码,但只是设置check_extractable=False绕过了PDFTextExtractionNotAllowed有问题的文件的异常(在其他查看器中打开良好)。

回答by IanJ

I had some issues trying to get qpdf to behave in my program. I found a useful library, pikepdf, that is based on qpdf and automatically converts pdfs to be extractable.

我在尝试让 qpdf 在我的程序中运行时遇到了一些问题。我发现了一个有用的库pikepdf,它基于 qpdf 并自动将 pdf 转换为可提取的。

The code to use this is pretty straightforward:

使用它的代码非常简单:

import pikepdf

pdf = pikepdf.open('unextractable.pdf')
pdf.save('extractable.pdf')

回答by AlfyFaisy

The 'check_extractable=True' argument is by design. Some PDFs explicitly disallow to extract text, and PDFMiner follows the directive. You can override it (giving check_extractable=False), but do it at your own risk.

'check_extractable=True' 参数是设计使然。一些 PDF 明确禁止提取文本,PDFMiner 遵循该指令。您可以覆盖它(给出 check_extractable=False),但风险自负。

回答by Knoweldgeyog

I too faced the same problem of parsing the secured pdf but it has got resolved using pikepdf library. I tried this library on my jupyter notebbok and on windows os but it gave errors but it worked smoothly on Ubuntu

我也遇到了解析安全 pdf 的同样问题,但它已经使用 pikepdf 库解决了。我在我的 jupyter notebbok 和 windows 操作系统上尝试了这个库,但它给出了错误,但它在 Ubuntu 上运行顺利

回答by komutohirowato

If you want to unlock all pdf files in a folder without renaming them, you may use this code:

如果您想解锁文件夹中的所有 pdf 文件而不重命名它们,您可以使用以下代码:

import glob, os, pikepdf

p = os.getcwd()
for file in glob.glob('*.pdf'):
   file_path = os.path.join(p, file).replace('\','/')
   init_pdf = pikepdf.open(file_path)
   new_pdf = pikepdf.new()
   new_pdf.pages.extend(init_pdf.pages)
   new_pdf.save(str(file))

In pikepdflibrary it is impossible to overwrite the existing file by saving it with the same name. In contrast, you would like to copy the pages to the newly created empty pdf file, and save it.

pikepdf库中,不可能通过使用相同的名称保存现有文件来覆盖它。相反,您想将页面复制到新创建的空 pdf 文件中,然后保存。