如何使用 Python 获取两个 PDF 文件的差异?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1310836/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 21:54:44  来源:igfitidea点击:

How to get the diff of two PDF files using Python?

pythonpdf

提问by Goutham

I need to find the difference between two PDF files. Does anybody know of any Python-related tool which has a feature that directly gives the diff of the two PDFs?

我需要找到两个 PDF 文件之间的区别。有没有人知道任何与 Python 相关的工具具有直接给出两个 PDF 的差异的功能?

采纳答案by fbuchinger

What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).

你所说的“差异”是什么意思?PDF 文本的差异或某些布局更改(例如调整了嵌入图形的大小)。第一个很容易检测,第二个几乎不可能获得(PDF 是一种非常复杂的文件格式,它提供了无穷无尽的文件格式化功能)。

If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts.

如果您想获得文本差异,只需在两个 PDF 上运行 pdf to text 实用程序,然后使用 Python 的内置差异库来获取转换文本的差异。

This question deals with pdf to text conversion in python: Python module for converting PDF to text.

这个问题涉及 pdf to text conversion in python: Python module forconversion PDF to text

The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.

此方法的可靠性取决于您使用的 PDF 生成器。如果您使用例如 Adob​​e Acrobat 和一些基于 Ghostscript 的 PDF-Creator 从相同的 word 文档制作两个 PDF,尽管源文档是相同的,您仍然可能会得到一个差异。

This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.

这是因为有很多方法可以将源文档的信息编码为 PDF,并且每个转换器使用不同的方法。通常 pdf 到文本转换器无法找出正确的文本流,尤其是对于复杂的布局或表格。

回答by Anurag Uniyal

I do not know your use case, but for regression tests of script which generates pdf using reportlab, I do diff pdfs by

我不知道您的用例,但是对于使用 reportlab 生成 pdf 的脚本的回归测试,我通过

  1. Converting each page to an image using ghostsript
  2. Diffing each page against page image of standard pdf, using PIL
  1. 使用 ghostsript 将每个页面转换为图像
  2. 使用 PIL 将每个页面与标准 pdf 的页面图像进行比较

e.g

例如

im1 = Image.open(imagePath1)
im2 = Image.open(imagePath2)

imDiff = ImageChops.difference(im1, im2)

This works in my case for flagging any changes introduced due to code changes.

在我的情况下,这适用于标记由于代码更改而引入的任何更改。

回答by gzerone

Met the same question on my encrypted pdf unittest, neither pdfminer nor pyPdf works well for me.

在我的加密 pdf unittest 上遇到了同样的问题,pdfminer 和 pyPdf 都不适合我。

Here are two commands (pdftocairo, pdftotext) work perfect on my test. (Ubuntu Install: apt-get install poppler-utils)

这里有两个命令(pdftocairo、pdftotext)在我的测试中完美运行。(Ubuntu 安装:apt-get install poppler-utils)

You can get pdf content by:

您可以通过以下方式获取 pdf 内容:

from subprocess import Popen, PIPE

def get_formatted_content(pdf_content):
    cmd = 'pdftocairo -pdf - -' # you can replace "pdftocairo -pdf" with "pdftotext" if you want to get diff info
    ps = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE)
    stdout, stderr = ps.communicate(input=pdf_content)
    if ps.returncode != 0:
        raise OSError(ps.returncode, cmd, stderr)
    return stdout

Seems pdftocairo can redraw pdf files, pdftotext can extract all text.

似乎pdftocairo可以重绘pdf文件,pdftotext可以提取所有文本。

And then you can compare two pdf files:

然后你可以比较两个pdf文件:

c1 = get_formatted_content(open('f1.pdf').read())
c2 = get_formatted_content(open('f2.pdf').read())
print(cmp(c1, c2)) # for binary compare
# import difflib
# print(list(difflib.unified_diff(c1, c2))) # for text compare

回答by Victor Schr?der

Even though this question is quite old, my guess is that I can contribute to the topic.

尽管这个问题已经很老了,但我的猜测是我可以为该主题做出贡献。

We have several applications generating tons of PDFs. One of these apps is written in Python and recently I wanted to write integration tests to check if the PDF generation was working correctly.

我们有几个应用程序生成大量的 PDF。其中一个应用程序是用 Python 编写的,最近我想编写集成测试来检查 PDF 生成是否正常工作。

Testing PDF generation is HARD, because the specs for PDF files are very complicated and non-deterministic. Two PDFs, generated with the same exact input data, will generate different files, so direct file comparison is discarded.

测试生成PDF是HARD,因为PDF文件的规格是非常复杂和不确定性。使用完全相同的输入数据生成的两个 PDF 将生成不同的文件,因此丢弃直接文件比较。

The solution: we have to go with testing the way they look like (because THATshould be deterministic!).

解决方案:我们必须测试它们的样子(因为应该是确定性的!)。

In our case, the PDFs are being generated with the reportlabpackage, but this doesn't matter from the test perspective, we just need a filename or the PDF blob (bytes) from the generator. We also need an expectation file containing a "good" PDF to compare with the one coming from the generator.

在我们的例子中,PDF 是用reportlab包生成的,但这从测试的角度来看并不重要,我们只需要一个文件名或来自生成器的 PDF blob(字节)。我们还需要一个包含“好”PDF 的期望文件,以与来自生成器的 PDF 进行比较。

The PDFs are converted to images and then compared. This can be done in multiple ways, but we decided to use ImageMagick, because it is extremely versatile and very mature, with bindings for almost every programming language out there. For Python 3, the bindings are offered by the Wandpackage.

PDF 将转换为图像,然后进行比较。这可以通过多种方式完成,但我们决定使用ImageMagick,因为它非常通用且非常成熟,几乎可以绑定所有编程语言。对于 Python 3,绑定由Wand包提供。

The test looks something like the following. Specific details of our implementation were removed and the example was simplified:

该测试类似于以下内容。删除了我们实现的具体细节,并简化了示例:

import os
from unittest import TestCase
from wand.image import Image
from app.generators.pdf import PdfGenerator


DIR = os.path.dirname(__file__)


class PdfGeneratorTest(TestCase):

    def test_generated_pdf_should_match_expectation(self):
        # `pdf` is the blob of the generated PDF
        # If using reportlab, this is what you get calling `getpdfdata()`
        # on a Canvas instance, after all the drawing is complete
        pdf = PdfGenerator().generate()

        # PDFs are vectorial, so we need to set a resolution when
        # converting to an image
        actual_img = Image(blob=pdf, resolution=150)

        filename = os.path.join(DIR, 'expected.pdf')

        # Make sure to use the same resolution as above
        with Image(filename=filename, resolution=150) as expected:
            diff = actual.compare(expected, metric='root_mean_square')
            self.assertLess(diff[1], 0.01)

The 0.01is as low as we can tolerate small differences. Considering that diff[1]varies from 0 to 1 using the root_mean_squaremetric, we are here accepting a difference up to 1% on all channels, comparing with the sample expected file.

0.01是低,因为我们可以容忍小的差异。考虑diff[1]到使用root_mean_square度量从 0 到 1 变化,我们在这里接受所有通道的差异高达 1%,与示例预期文件相比。

回答by mtasic85

Check this out, it can be useful: http://pybrary.net/pyPdf/

看看这个,它可能很有用:http: //pybrary.net/pyPdf/