使用java比较两个pdf文件（方法）

Question

提问by Alvin

i need to write a java class that compares two pdf files and points out the differences(differences in text/position/font) using some sort of highlighting. my initial approach was use pdfbox to parse the file using pdfbox and store the extracted text using in some data structure that would help me with comparing. Is there any java library that can extract the text,preserve the formatting,help me with indexing and comparing.Can i use tika/ google's diff-match for this. tika extracts text in the form of xhtml but how can i compare two xhtml files?

我需要编写一个 java 类来比较两个 pdf 文件并使用某种突出显示指出差异（文本/位置/字体的差异）。我最初的方法是使用 pdfbox 使用 pdfbox 解析文件，并将提取的文本存储在一些有助于我进行比较的数据结构中。是否有任何可以提取文本、保留格式、帮助我进行索引和比较的 Java 库。我可以为此使用 tika/google 的 diff-match。tika 以 xhtml 的形式提取文本，但如何比较两个 xhtml 文件？

Answer 1

回答by Sajal Dutta

As you mentioned, pdfboxto extract its contents and then use google's diffto compare.

正如您所提到的，pdfbox提取其内容，然后使用google 的 diff进行比较。

Answer 2

回答by n002213f

Check this poston comparing PDF documents. Take note of the line;

查看这篇关于比较 PDF 文档的帖子。记下这条线；

PDF is a flexible file format in which you can do things in many different ways. So you could create 2 different PDF versions of a file using Acrobat and Ghostscript (as an example). The files would (hopefully) be identical. But the files would be different sizes and the internal structure of each would be very different

PDF 是一种灵活的文件格式，您可以在其中以多种不同的方式进行操作。因此，您可以使用 Acrobat 和 Ghostscript（作为示例）创建文件的 2 个不同 PDF 版本。这些文件（希望）是相同的。但是文件的大小不同，每个文件的内部结构也大不相同

Answer 3

回答by vins

I had to compare tons of pdf files in my project. my requirement was to compare the pdf files by pixel by pixel. After a lot of googling and as i could not find anything good, I ended up creating my own pdf utility for this purpose.

我不得不在我的项目中比较大量的 pdf 文件。我的要求是逐个像素地比较 pdf 文件。经过大量的谷歌搜索，因为我找不到任何好的东西，我最终为此创建了自己的 pdf 实用程序。

Please check this blog for more details & jar download.

请查看此博客以获取更多详细信息和 jar 下载。

http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/

Answer 4

回答by Raju Penumatsa

I don't know if you were able to solve your problem. Here is my approach to solve this.

不知道你有没有解决你的问题。这是我解决这个问题的方法。

First let's convert PDFs to HTMLs using Pdf2domand then use daisydiffto generate comparison report in HTML. If you want a PDF then convert that HTML report to PDF. But keep in mind that PDF to HTML conversion is not 100% accurate due to complexities in PDF. You can also try another approach of converting PDFs to images and compare pixel to pixel and generate a PDF report. You can try PDFcomparelibrary. It looks promising to me. Let me know if anyone already tried this.

首先让我们使用Pdf2dom将 PDF 转换为 HTML ，然后使用daisydiff生成 HTML 中的比较报告。如果您需要 PDF，则将该 HTML 报告转换为 PDF。但请记住，由于 PDF 的复杂性，PDF 到 HTML 的转换并非 100% 准确。您还可以尝试另一种将 PDF 转换为图像并比较像素与像素并生成 PDF 报告的方法。您可以尝试PDFcompare库。对我来说看起来很有希望。如果有人已经尝试过，请告诉我。

Answer 5

回答by Tarun Kumar Nayak

Refer the below given sample code for pdf comparison.

请参阅下面给出的示例代码以进行 pdf 比较。

ZPDFCompare obj = new ZPDFCompare();
obj.pdfcompare("C:\Users\Desktop\expectedFile.pdf", "C:\Users\Desktop\actualFile.pdf", "C:\Users\Desktop\expectedFile_Diff.pdf","C:\Users\tarun.kumar\Desktop\actualFile_Diff.pdf");

zeonpad provided the free java api for pdf comparison.

zeonpad 提供了免费的 java api 用于 pdf 比较。

使用java比较两个pdf文件（方法）

提问by Alvin

回答by Sajal Dutta

回答by n002213f

回答by vins

回答by Raju Penumatsa

回答by Tarun Kumar Nayak

相关推荐

最近更新

标签

使用java比较两个pdf文件（方法）

提问by Alvin

回答by Sajal Dutta

回答by n002213f

回答by vins

回答by Raju Penumatsa

回答by Tarun Kumar Nayak

相关推荐

Java 无法理解对象状态、行为和身份？

Java 服务无法识别将自签名证书导入 Docker 的 JRE cacert

Java “同步”是什么意思？

如何在java中的特定字符后修剪字符串

相关推荐

最近更新

标签