检查 PDF 文件是否有效 (Python)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/559096/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 20:22:15  来源:igfitidea点击:

Check whether a PDF-File is valid (Python)

pythonfilepdf

提问by theomega

I get a File via a HTTP-Upload and need to be sure its a pdf-file.Programing Language is Python, but this should not matter.

我通过 HTTP 上传得到一个文件,需要确保它是一个 pdf 文件。编程语言是 Python,但这应该无关紧要。

I thought of the following solutions:

我想到了以下解决方案:

  1. Check if the first bytes of the string are "%PDF". This is not a good check but prevents the use from uploading other files accidentally.

  2. Try the libmagic (the "file" command on the bash uses it). This does exactly the same check as 1.

  3. Take a lib and try to read the page-count out of the file. If the lib is able to read a pagecount it should be a valid pdf. Problem: I dont know a lib for python which can do this

  1. 检查字符串的第一个字节是否为“%PDF”。这不是一个好的检查,但可以防止用户意外上传其他文件。

  2. 尝试使用 libmagic(bash 上的“file”命令使用它)。这与 1 的检查完全相同。

  3. 获取一个库并尝试从文件中读取页数。如果库能够读取页数,则它应该是有效的 pdf。问题:我不知道可以执行此操作的 python 库

So anybody got any solutions for a lib or another trick?

那么有人对 lib 或其他技巧有任何解决方案吗?

Thanks

谢谢

采纳答案by Van Gale

The two most commonly used PDF libraries for Python are:

Python 最常用的两个 PDF 库是:

Both are pure python so should be easy to install as well be cross-platform.

两者都是纯python,因此应该易于安装以及跨平台。

With pyPdf it would probably be as simple as doing:

使用 pyPdf 可能就像这样做一样简单:

from pyPdf import PdfFileReader
doc = PdfFileReader(file("upload.pdf", "rb"))

This should be enough, but docwill now have documentInfo()and numPages()methods if you want to do further checking.

这应该足够了,但是如果您想进一步检查,doc现在将拥有documentInfo()numPages()方法。

As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF's due to system overhead of forking a new process, etc.

正如卡尔回答的那样,pdftotext 也是一个很好的解决方案,并且在非常大的文档(尤其是具有许多交叉引用的文档)上可能会更快。但是,由于分叉新进程的系统开销等,它在小型 PDF 上可能会慢一点。

回答by WoJ

Since apparently neither PyPdfnor ReportLabis available anymore, the current solution I found (as of 2015) is to use PyPDF2and catch exceptions (and possibly analyze getDocumentInfo())

由于显然既不可用PyPdf也不ReportLab可用,我发现的当前解决方案(截至 2015 年)是使用PyPDF2和捕获异常(并可能分析getDocumentInfo()

import PyPDF2

with open("testfile.txt", "w") as f:
    f.write("hello world!")

try:
    PyPDF2.PdfFileReader(open("testfile.txt", "rb"))
except PyPDF2.utils.PdfReadError:
    print("invalid PDF file")
else:
    pass

回答by MrTopf

In a project if mine I need to check for the mime type of some uploaded file. I simply use the file command like this:

在我的项目中,我需要检查某些上传文件的 MIME 类型。我只是像这样使用 file 命令:

from subprocess import Popen, PIPE
filetype = Popen("/usr/bin/file -b --mime -", shell=True, stdout=PIPE, stdin=PIPE).communicate(file.read(1024))[0].strip()

You of course might want to move the actual command into some configuration file as also command line options vary among operating systems (e.g. mac).

您当然可能希望将实际命令移动到某个配置文件中,因为命令行选项也因操作系统(例如 mac)而异。

If you just need to know whether it's a PDF or not and do not need to process it anyway I think the file command is a faster solution than a lib. Doing it by hand is of course also possible but the file command gives you maybe more flexibility if you want to check for different types.

如果您只需要知道它是否是 PDF 并且无论如何都不需要处理它,我认为 file 命令是比 lib 更快的解决方案。手动执行当然也是可能的,但是如果您想检查不同的类型,文件命令可能会给您更大的灵活性。

回答by Cal Jacobson

If you're on a Linux or OS X box, you could use Pdftotext(part of Xpdf, found here). If you pass a non-PDF to pdftotext, it will certainly bark at you, and you can use commands.getstatusoutput to get the output and parse it for these warnings.

如果您使用的是 Linux 或 OS X 机器,则可以使用Pdftotext(Xpdf 的一部分,可在此处找到)。如果您将非 PDF 传递给 pdftotext,它肯定会向您吠叫,您可以使用 commands.getstatusoutput 来获取输出并解析这些警告。

If you're looking for a platform-independent solution, you might be able to make use of pyPdf.

如果您正在寻找独立于平台的解决方案,则可以使用pyPdf

Edit:It's not elegant, but it looks like pyPdf's PdfFileReader will throw an IOError(22) if you attempt to load a non-PDF.

编辑:它并不优雅,但看起来如果您尝试加载非 PDF,pyPdf 的 PdfFileReader 会抛出 IOError(22)。

回答by Steve Claridge

By valid do you mean that it can be displayed by a PDF viewer, or that the text can be extracted? They are two very different things.

有效是指它可以由 PDF 查看器显示,还是可以提取文本?它们是两个非常不同的东西。

If you just want to check that it really is a PDF file that has been uploaded then the pyPDF solution, or something similar, will work.

如果您只想检查它是否真的是已上传的 PDF 文件,那么 pyPDF 解决方案或类似的解决方案将起作用。

If, however, you want to check that the text can be extracted then you have found a whole world of pain! Using pdftotext would be a simple solution that would work in a majority of cases but it is by no means 100% successful. We have found many examples of PDFs that pdftotext cannot extract from but Java libraries such as iText and PDFBox can.

但是,如果您想检查是否可以提取文本,那么您已经发现整个世界都很痛苦!使用 pdftotext 将是一个简单的解决方案,可以在大多数情况下工作,但绝不是 100% 成功。我们发现了许多 pdftotext 无法提取的 PDF 示例,但 iText 和 PDFBox 等 Java 库可以。

回答by Maged Saeed

I run into the same problem but was not forced to use a programming language to manage this task. I used pyPDF but was not efficient for me as it hangs infinitely on some corrupted files.

我遇到了同样的问题,但并没有被迫使用编程语言来管理这个任务。我使用了 pyPDF 但对我来说效率不高,因为它无限挂在一些损坏的文件上。

However, I found this software useful till now.

但是,我发现这个软件直到现在都很有用。

Good luck with it.

祝你好运。

https://sourceforge.net/projects/corruptedpdfinder/

https://sourceforge.net/projects/corruptedpdfinder/