linux命令行上的PDF比较

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/6469157/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 04:43:09  来源:igfitidea点击:

PDF compare on linux command line

linuxpdfcomparisonghostscript

提问by Christof Aenderl

I'm looking for a Linux command line tool to compare two PDF files and save the diffs to a PDF outfile. The tool should create diff-pdf's in a batch-process. The PDF files are construction plans, so pure text-compare doesn't work.

我正在寻找一个 Linux 命令行工具来比较两个 PDF 文件并将差异保存到 PDF 输出文件。该工具应该在批处理中创建 diff-pdf。PDF 文件是施工图,因此纯文本比较不起作用。

Something like:

就像是:

<tool> file1.pdf file2.pdf -o diff-out.pdf

Most of the tools I found convert the PDFs to images and compare them, but only with a GUI.

我发现的大多数工具将 PDF 转换为图像并进行比较,但只能使用 GUI。

Any other solution is also welcome.

也欢迎任何其他解决方案。

采纳答案by Christof Aenderl

Done in 2 lines with (the allmighty) imagemagick and pdftk:

用(全能的)imagemagick 和 pdftk 在 2 行中完成:

compare -verbose -debug coder $PDF_1 $PDF_2 -compose src $OUT_FILE.tmp
pdftk $OUT_FILE.tmp background $PDF_1 output $OUT_FILE

The options -verbose and -debug are optional.

选项 -verbose 和 -debug 是可选的。

  • compare creates a PDF with the diff as red pixels.
  • pdftk merges the diff-pdf with background PDF_1
  • compare 创建一个 PDF,差异为红色像素。
  • pdftk 将 diff-pdf 与背景 PDF_1 合并

回答by xevincent

Here is a hack to do it.

这是一个黑客来做到这一点。

pdftotext file1.pdf
pdftotext file2.pdf
diff file1.txt file2.txt

回答by Kurt Pfeifle

I've written my own script that does something similar to what you're asking for. The script uses 4 tools to achieve its goal:

我已经编写了自己的脚本,它的功能与您所要求的类似。该脚本使用 4 个工具来实现其目标:

  1. ImageMagick's comparecommand
  2. the pdftkutility (if you have multipage PDFs)
  3. Ghostscript (optional)
  4. md5sum(optional)
  1. ImageMagick 的compare命令
  2. pdftk实用程序(如果你有PDF的多页)
  3. 幽灵脚本(可选)
  4. md5sum(可选的)

It should be quite easy to port this to a .batbatch file for DOS/Windows.

将它移植到.batDOS/Windows的批处理文件应该很容易。

But first, please note:this only works well for PDFs which have the same page/media size. The comparison is done pixel by pixel between the two input PDFs. The resulting file is an image showing the "diff" like this:

但首先,请注意:这仅适用于具有相同页面/媒体大小的 PDF。比较是在两个输入 PDF 之间逐像素完成的。生成的文件是一个显示“差异”的图像,如下所示:

  • Each pixel that remains unchanged becomes white.
  • Each pixel that got changed is painted in red.
  • 每个保持不变的像素都会变成白色。
  • 每个改变的像素都涂成红色。

That diff image is saved as a new PDF to make it better accessible on different OS platforms.

该差异图像被保存为新的 PDF,以便在不同的操作系统平台上更好地访问它。

I'm using this for example to discover minimal page display differences when font substitution in PDF processing comes into play.

例如,当 PDF 处理中的字体替换发挥作用时,我正在使用它来发现最小的页面显示差异。

It could happen, that there is no visible difference between your PDFs, though they are different in MD5 hashes and/or file size. In this case the "diff" output PDF page would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.

可能会发生,尽管 PDF 的 MD5 哈希值和/或文件大小不同,但它们之间没有明显差异。在这种情况下,“差异”输出 PDF 页面将变为全白。您可以自动发现这种情况,因此您只需通过自动删除全白 PDF 来直观地调查非白色 PDF。

Here are the building blocks:

以下是构建块:

pdftk

pdftk

Use this command line utility to split multipage PDF files into multiple singlepage PDFs:

使用此命令行实用程序将多页 PDF 文件拆分为多个单页 PDF:

pdftk  file_1.pdf  burst  output  somewhere/file_1---page_%03d.pdf
pdftk  file_2.pdf  burst  output  somewhere/file_2---page_%03d.pdf

If you are comparing 1-page PDFs only, this building block is optional. Since you talk about "construction plans", this is likely the case.

如果您仅比较 1 页 PDF,则此构建块是可选的。既然你说的是“建设计划”,那很可能就是这种情况。

compare

相比

Use this command line utility from ImageMagick to create a "diff" PDF page for each of the pages:

使用 ImageMagick 中的此命令行实用程序为每个页面创建一个“差异”PDF 页面:

compare \
       -verbose \
       -debug coder \
       -log "%u %m:%l %e" \
        somewhere/file_1---page_001.pdf \
        somewhere/file_2---page_001.pdf \
       -compose src \
        somewhereelse/file_1--file_2---diff_page_001.pdf

Ghostscript

幽灵脚本

Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.

由于自动插入的元数据(例如当前日期+时间),PDF 输出不适用于基于 MD5hash 的文件比较。

If you want to automatically discover all cases where the diff PDF consist of a purely white page, you should convert the PDF page to a meta-data free bitmap format using the bmp256output device. You can do that like this:

如果您想自动发现 diff PDF 由纯白页面组成的所有情况,您应该使用bmp256输出设备将 PDF 页面转换为无元数据位图格式。你可以这样做:

First, find out what the page size format of your PDF is. Again, this little utility identifycomes as part of any ImageMagick installation:

首先,找出您的 PDF 的页面大小格式是什么。同样,这个小实用程序identify是任何 ImageMagick 安装的一部分:

 identify \
   -format "%[fx:(w)]x%[fx:(h)]" \
    somewhereelse/file_1--file_2---diff_page_001.pdf

You can store this value in an environment variable like this:

您可以将此值存储在这样的环境变量中:

 export my_size=$(identify \
   -format "%[fx:(w)]x%[fx:(h)]" \
    somewhereelse/file_1--file_2---diff_page_001.pdf)

Now Ghostscript comes into play, using a commandline which includes the above discovered page size as it is stored in the variable:

现在 Ghostscript 开始发挥作用,使用命令行包含上面发现的页面大小,因为它存储在变量中:

 gs \
   -o somewhereelse/file_1--file_2---diff_page_001.ppm \
   -sDEVICE=ppmraw \
   -r72 \
   -g${my_size} \
    somewhereelse/file_1--file_2---diff_page_001.pdf

This gives you a PPM (Portable PixMap) with a resolution of 72 dpi from the original PDF page. 72 dpi usually is good enough for what we want... Next, create a purely white PPM page with the same page size:

这为您提供了原始 PDF 页面分辨率为 72 dpi 的 PPM(便携式像素图)。72 dpi 通常足以满足我们的需求……接下来,创建一个具有相同页面大小的纯白色 PPM 页面:

 gs \
   -o somewhereelse/file_1--file_2---whitepage_001.ppm \
   -sDEVICE=ppmraw \
   -r72 \
   -g${my_size} \
   -c "showpage"

The -c "showpage"part is a PostScript command that tells Ghostscript to emit an empty page only.

-c "showpage"部分是一个 PostScript 命令,它告诉 Ghostscript 只发出一个空页面。

MD5 sum

MD5 总和

Use the MD5 hash to automatically compare the original PPM with the whitepage PPM. In case they are the same, you can savely assume that there are no differences between the PDFs and therefore rename or delete the diff-PDF:

使用 MD5 哈希自动将原始 PPM 与白页 PPM 进行比较。如果它们相同,您可以轻松地假设 PDF 之间没有差异,因此重命名或删除 diff-PDF:

 MD5_1=$(md5sum somewhereelse/file_1--file_2---diff_page_001.ppm | awk '{print }')
 MD5_2=$(md5sum somewhereelse/file_1--file_2---whitepage_001.ppm | awk '{print }')

 if [ "x${MD5_1}" == "x${MD5_2}" ]; then 
     mv  \
       somewhereelse/file_1--file_2---diff_page_001.pdf \
       somewhereelse/file_1--file_2---NODIFFERENCE_page_001.pdf # rename all-white PDF
     rm  \
       somewhereelse/file_1--file_2---*_page_001.ppm            # delete both PPMs
 fi

This spares you from having to visually inspect "diff PDFs" that do not have any differences.

这使您不必目视检查没有任何差异的“差异 PDF”。