linux命令行上的PDF比较
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6469157/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
PDF compare on linux command line
提问by Christof Aenderl
I'm looking for a Linux command line tool to compare two PDF files and save the diffs to a PDF outfile. The tool should create diff-pdf's in a batch-process. The PDF files are construction plans, so pure text-compare doesn't work.
我正在寻找一个 Linux 命令行工具来比较两个 PDF 文件并将差异保存到 PDF 输出文件。该工具应该在批处理中创建 diff-pdf。PDF 文件是施工图,因此纯文本比较不起作用。
Something like:
就像是:
<tool> file1.pdf file2.pdf -o diff-out.pdf
Most of the tools I found convert the PDFs to images and compare them, but only with a GUI.
我发现的大多数工具将 PDF 转换为图像并进行比较,但只能使用 GUI。
Any other solution is also welcome.
也欢迎任何其他解决方案。
采纳答案by Christof Aenderl
Done in 2 lines with (the allmighty) imagemagick and pdftk:
用(全能的)imagemagick 和 pdftk 在 2 行中完成:
compare -verbose -debug coder $PDF_1 $PDF_2 -compose src $OUT_FILE.tmp
pdftk $OUT_FILE.tmp background $PDF_1 output $OUT_FILE
The options -verbose and -debug are optional.
选项 -verbose 和 -debug 是可选的。
- compare creates a PDF with the diff as red pixels.
- pdftk merges the diff-pdf with background PDF_1
- compare 创建一个 PDF,差异为红色像素。
- pdftk 将 diff-pdf 与背景 PDF_1 合并
回答by xevincent
Here is a hack to do it.
这是一个黑客来做到这一点。
pdftotext file1.pdf
pdftotext file2.pdf
diff file1.txt file2.txt
回答by Kurt Pfeifle
I've written my own script that does something similar to what you're asking for. The script uses 4 tools to achieve its goal:
我已经编写了自己的脚本,它的功能与您所要求的类似。该脚本使用 4 个工具来实现其目标:
- ImageMagick's
compare
command - the
pdftk
utility (if you have multipage PDFs) - Ghostscript (optional)
md5sum
(optional)
- ImageMagick 的
compare
命令 - 该
pdftk
实用程序(如果你有PDF的多页) - 幽灵脚本(可选)
md5sum
(可选的)
It should be quite easy to port this to a .bat
batch file for DOS/Windows.
将它移植到.bat
DOS/Windows的批处理文件应该很容易。
But first, please note:this only works well for PDFs which have the same page/media size. The comparison is done pixel by pixel between the two input PDFs. The resulting file is an image showing the "diff" like this:
但首先,请注意:这仅适用于具有相同页面/媒体大小的 PDF。比较是在两个输入 PDF 之间逐像素完成的。生成的文件是一个显示“差异”的图像,如下所示:
- Each pixel that remains unchanged becomes white.
- Each pixel that got changed is painted in red.
- 每个保持不变的像素都会变成白色。
- 每个改变的像素都涂成红色。
That diff image is saved as a new PDF to make it better accessible on different OS platforms.
该差异图像被保存为新的 PDF,以便在不同的操作系统平台上更好地访问它。
I'm using this for example to discover minimal page display differences when font substitution in PDF processing comes into play.
例如,当 PDF 处理中的字体替换发挥作用时,我正在使用它来发现最小的页面显示差异。
It could happen, that there is no visible difference between your PDFs, though they are different in MD5 hashes and/or file size. In this case the "diff" output PDF page would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.
可能会发生,尽管 PDF 的 MD5 哈希值和/或文件大小不同,但它们之间没有明显差异。在这种情况下,“差异”输出 PDF 页面将变为全白。您可以自动发现这种情况,因此您只需通过自动删除全白 PDF 来直观地调查非白色 PDF。
Here are the building blocks:
以下是构建块:
pdftk
pdftk
Use this command line utility to split multipage PDF files into multiple singlepage PDFs:
使用此命令行实用程序将多页 PDF 文件拆分为多个单页 PDF:
pdftk file_1.pdf burst output somewhere/file_1---page_%03d.pdf
pdftk file_2.pdf burst output somewhere/file_2---page_%03d.pdf
If you are comparing 1-page PDFs only, this building block is optional. Since you talk about "construction plans", this is likely the case.
如果您仅比较 1 页 PDF,则此构建块是可选的。既然你说的是“建设计划”,那很可能就是这种情况。
compare
相比
Use this command line utility from ImageMagick to create a "diff" PDF page for each of the pages:
使用 ImageMagick 中的此命令行实用程序为每个页面创建一个“差异”PDF 页面:
compare \
-verbose \
-debug coder \
-log "%u %m:%l %e" \
somewhere/file_1---page_001.pdf \
somewhere/file_2---page_001.pdf \
-compose src \
somewhereelse/file_1--file_2---diff_page_001.pdf
Ghostscript
幽灵脚本
Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.
由于自动插入的元数据(例如当前日期+时间),PDF 输出不适用于基于 MD5hash 的文件比较。
If you want to automatically discover all cases where the diff PDF consist of a purely white page, you should convert the PDF page to a meta-data free bitmap format using the bmp256
output device. You can do that like this:
如果您想自动发现 diff PDF 由纯白页面组成的所有情况,您应该使用bmp256
输出设备将 PDF 页面转换为无元数据位图格式。你可以这样做:
First, find out what the page size format of your PDF is. Again, this little utility identify
comes as part of any ImageMagick installation:
首先,找出您的 PDF 的页面大小格式是什么。同样,这个小实用程序identify
是任何 ImageMagick 安装的一部分:
identify \
-format "%[fx:(w)]x%[fx:(h)]" \
somewhereelse/file_1--file_2---diff_page_001.pdf
You can store this value in an environment variable like this:
您可以将此值存储在这样的环境变量中:
export my_size=$(identify \
-format "%[fx:(w)]x%[fx:(h)]" \
somewhereelse/file_1--file_2---diff_page_001.pdf)
Now Ghostscript comes into play, using a commandline which includes the above discovered page size as it is stored in the variable:
现在 Ghostscript 开始发挥作用,使用命令行包含上面发现的页面大小,因为它存储在变量中:
gs \
-o somewhereelse/file_1--file_2---diff_page_001.ppm \
-sDEVICE=ppmraw \
-r72 \
-g${my_size} \
somewhereelse/file_1--file_2---diff_page_001.pdf
This gives you a PPM (Portable PixMap) with a resolution of 72 dpi from the original PDF page. 72 dpi usually is good enough for what we want... Next, create a purely white PPM page with the same page size:
这为您提供了原始 PDF 页面分辨率为 72 dpi 的 PPM(便携式像素图)。72 dpi 通常足以满足我们的需求……接下来,创建一个具有相同页面大小的纯白色 PPM 页面:
gs \
-o somewhereelse/file_1--file_2---whitepage_001.ppm \
-sDEVICE=ppmraw \
-r72 \
-g${my_size} \
-c "showpage"
The -c "showpage"
part is a PostScript command that tells Ghostscript to emit an empty page only.
该-c "showpage"
部分是一个 PostScript 命令,它告诉 Ghostscript 只发出一个空页面。
MD5 sum
MD5 总和
Use the MD5 hash to automatically compare the original PPM with the whitepage PPM. In case they are the same, you can savely assume that there are no differences between the PDFs and therefore rename or delete the diff-PDF:
使用 MD5 哈希自动将原始 PPM 与白页 PPM 进行比较。如果它们相同,您可以轻松地假设 PDF 之间没有差异,因此重命名或删除 diff-PDF:
MD5_1=$(md5sum somewhereelse/file_1--file_2---diff_page_001.ppm | awk '{print }')
MD5_2=$(md5sum somewhereelse/file_1--file_2---whitepage_001.ppm | awk '{print }')
if [ "x${MD5_1}" == "x${MD5_2}" ]; then
mv \
somewhereelse/file_1--file_2---diff_page_001.pdf \
somewhereelse/file_1--file_2---NODIFFERENCE_page_001.pdf # rename all-white PDF
rm \
somewhereelse/file_1--file_2---*_page_001.ppm # delete both PPMs
fi
This spares you from having to visually inspect "diff PDFs" that do not have any differences.
这使您不必目视检查没有任何差异的“差异 PDF”。