如何使用 Python 获取两个 PDF 文件的差异？

Question

提问by Goutham

I need to find the difference between two PDF files. Does anybody know of any Python-related tool which has a feature that directly gives the diff of the two PDFs?

我需要找到两个 PDF 文件之间的区别。有没有人知道任何与 Python 相关的工具具有直接给出两个 PDF 的差异的功能？

Answer 1

采纳答案by fbuchinger

What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).

你所说的“差异”是什么意思？PDF 文本的差异或某些布局更改（例如调整了嵌入图形的大小）。第一个很容易检测，第二个几乎不可能获得（PDF 是一种非常复杂的文件格式，它提供了无穷无尽的文件格式化功能）。

If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts.

如果您想获得文本差异，只需在两个 PDF 上运行 pdf to text 实用程序，然后使用 Python 的内置差异库来获取转换文本的差异。

This question deals with pdf to text conversion in python: Python module for converting PDF to text.

这个问题涉及 pdf to text conversion in python: Python module forconversion PDF to text。

The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.

此方法的可靠性取决于您使用的 PDF 生成器。如果您使用例如 Adobe Acrobat 和一些基于 Ghostscript 的 PDF-Creator 从相同的 word 文档制作两个 PDF，尽管源文档是相同的，您仍然可能会得到一个差异。

This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.

这是因为有很多方法可以将源文档的信息编码为 PDF，并且每个转换器使用不同的方法。通常 pdf 到文本转换器无法找出正确的文本流，尤其是对于复杂的布局或表格。

Answer 2

回答by Anurag Uniyal

I do not know your use case, but for regression tests of script which generates pdf using reportlab, I do diff pdfs by

我不知道您的用例，但是对于使用 reportlab 生成 pdf 的脚本的回归测试，我通过

Converting each page to an image using ghostsript
Diffing each page against page image of standard pdf, using PIL

使用 ghostsript 将每个页面转换为图像
使用 PIL 将每个页面与标准 pdf 的页面图像进行比较

e.g

例如

im1 = Image.open(imagePath1)
im2 = Image.open(imagePath2)

imDiff = ImageChops.difference(im1, im2)

This works in my case for flagging any changes introduced due to code changes.

在我的情况下，这适用于标记由于代码更改而引入的任何更改。

Answer 3

回答by gzerone

Met the same question on my encrypted pdf unittest, neither pdfminer nor pyPdf works well for me.

在我的加密 pdf unittest 上遇到了同样的问题，pdfminer 和 pyPdf 都不适合我。

Here are two commands (pdftocairo, pdftotext) work perfect on my test. (Ubuntu Install: apt-get install poppler-utils)

这里有两个命令（pdftocairo、pdftotext）在我的测试中完美运行。（Ubuntu 安装：apt-get install poppler-utils）

You can get pdf content by:

您可以通过以下方式获取 pdf 内容：

from subprocess import Popen, PIPE

def get_formatted_content(pdf_content):
    cmd = 'pdftocairo -pdf - -' # you can replace "pdftocairo -pdf" with "pdftotext" if you want to get diff info
    ps = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE)
    stdout, stderr = ps.communicate(input=pdf_content)
    if ps.returncode != 0:
        raise OSError(ps.returncode, cmd, stderr)
    return stdout

Seems pdftocairo can redraw pdf files, pdftotext can extract all text.

似乎pdftocairo可以重绘pdf文件，pdftotext可以提取所有文本。

And then you can compare two pdf files:

然后你可以比较两个pdf文件：

c1 = get_formatted_content(open('f1.pdf').read())
c2 = get_formatted_content(open('f2.pdf').read())
print(cmp(c1, c2)) # for binary compare
# import difflib
# print(list(difflib.unified_diff(c1, c2))) # for text compare

Answer 4

回答by Victor Schr?der

Even though this question is quite old, my guess is that I can contribute to the topic.

尽管这个问题已经很老了，但我的猜测是我可以为该主题做出贡献。

We have several applications generating tons of PDFs. One of these apps is written in Python and recently I wanted to write integration tests to check if the PDF generation was working correctly.

我们有几个应用程序生成大量的 PDF。其中一个应用程序是用 Python 编写的，最近我想编写集成测试来检查 PDF 生成是否正常工作。

Testing PDF generation is HARD, because the specs for PDF files are very complicated and non-deterministic. Two PDFs, generated with the same exact input data, will generate different files, so direct file comparison is discarded.

测试生成PDF是HARD，因为PDF文件的规格是非常复杂和不确定性。使用完全相同的输入数据生成的两个 PDF 将生成不同的文件，因此丢弃直接文件比较。

The solution: we have to go with testing the way they look like (because THATshould be deterministic!).

解决方案：我们必须测试它们的样子（因为那应该是确定性的！）。

In our case, the PDFs are being generated with the reportlabpackage, but this doesn't matter from the test perspective, we just need a filename or the PDF blob (bytes) from the generator. We also need an expectation file containing a "good" PDF to compare with the one coming from the generator.

在我们的例子中，PDF 是用reportlab包生成的，但这从测试的角度来看并不重要，我们只需要一个文件名或来自生成器的 PDF blob（字节）。我们还需要一个包含“好”PDF 的期望文件，以与来自生成器的 PDF 进行比较。

The PDFs are converted to images and then compared. This can be done in multiple ways, but we decided to use ImageMagick, because it is extremely versatile and very mature, with bindings for almost every programming language out there. For Python 3, the bindings are offered by the Wandpackage.

PDF 将转换为图像，然后进行比较。这可以通过多种方式完成，但我们决定使用ImageMagick，因为它非常通用且非常成熟，几乎可以绑定所有编程语言。对于 Python 3，绑定由Wand包提供。

The test looks something like the following. Specific details of our implementation were removed and the example was simplified:

该测试类似于以下内容。删除了我们实现的具体细节，并简化了示例：

import os
from unittest import TestCase
from wand.image import Image
from app.generators.pdf import PdfGenerator


DIR = os.path.dirname(__file__)


class PdfGeneratorTest(TestCase):

    def test_generated_pdf_should_match_expectation(self):
        # `pdf` is the blob of the generated PDF
        # If using reportlab, this is what you get calling `getpdfdata()`
        # on a Canvas instance, after all the drawing is complete
        pdf = PdfGenerator().generate()

        # PDFs are vectorial, so we need to set a resolution when
        # converting to an image
        actual_img = Image(blob=pdf, resolution=150)

        filename = os.path.join(DIR, 'expected.pdf')

        # Make sure to use the same resolution as above
        with Image(filename=filename, resolution=150) as expected:
            diff = actual.compare(expected, metric='root_mean_square')
            self.assertLess(diff[1], 0.01)

The 0.01is as low as we can tolerate small differences. Considering that diff[1]varies from 0 to 1 using the root_mean_squaremetric, we are here accepting a difference up to 1% on all channels, comparing with the sample expected file.

该0.01是低，因为我们可以容忍小的差异。考虑diff[1]到使用root_mean_square度量从 0 到 1 变化，我们在这里接受所有通道的差异高达 1%，与示例预期文件相比。

Answer 5

回答by mtasic85

Check this out, it can be useful: http://pybrary.net/pyPdf/

看看这个，它可能很有用：http: //pybrary.net/pyPdf/

如何使用 Python 获取两个 PDF 文件的差异？

提问by Goutham

采纳答案by fbuchinger

回答by Anurag Uniyal

回答by gzerone

回答by Victor Schr?der

回答by mtasic85

相关推荐

最近更新

标签

如何使用 Python 获取两个 PDF 文件的差异？

提问by Goutham

采纳答案by fbuchinger

回答by Anurag Uniyal

回答by gzerone

回答by Victor Schr?der

回答by mtasic85

相关推荐

python Django ModelForm CheckBox 小部件

python 正则表达式匹配错误

python 在 GTK 中缩放图像

python 如何在 GTK 中更改字体大小？

相关推荐

最近更新

标签