是否有适用于 PHP 的 PDF 解析器?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1251956/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is there a PDF parser for PHP?
提问by elviejo79
Hi I know about several PDF Generatorsfor php (fpdf, dompdf, etc.) What I want to know is about a parser.
嗨,我知道几个用于 php 的PDF生成器(fpdf、dompdf 等)。我想知道的是关于解析器。
For reasons beyond my control, certain information I need is only in a table inside a pdf and I need to extract that table and convert it to an array.
由于我无法控制的原因,我需要的某些信息仅在 pdf 内的表格中,我需要提取该表格并将其转换为数组。
Any suggestions?
有什么建议?
回答by ircmaxell
I've written one before (for similar needs), and I can say this: Have fun. It's quite a complex task. The PDF specificationis large and unwieldy. There are several methods of storing text inside of it. And the kicker is that each PDF generator is different in how it works. So while something like TFPDF or DOMPDF creates REALLY easy to read PDFs (from a machine standpoint), Acrobat makes some really hellish documents.
我之前写过一篇(针对类似的需求),我可以这样说:玩得开心。这是一项相当复杂的任务。该PDF规格大而笨重。有几种方法可以在其中存储文本。更重要的是,每个 PDF 生成器的工作方式都不同。因此,虽然像 TFPDF 或 DOMPDF 之类的东西创建了非常容易阅读的 PDF(从机器的角度来看),但 Acrobat 制作了一些非常糟糕的文档。
The reason is how it writes the text. Most DOM based renderers --that I've used-- write the entire line as one string, and position it once (which is really easy to read). Acrobat tries to be more efficient (and it is) by writing only one or maybe a few characters at a time, and positioning them independently. While this REALLY simplifies rendering, it makes reading MUCH more difficult.
原因在于它如何编写文本。大多数基于 DOM 的渲染器——我用过的——将整行写成一个字符串,然后定位一次(这真的很容易阅读)。Acrobat 试图通过一次只写一个或几个字符并独立定位它们来提高效率(并且确实如此)。虽然这确实简化了渲染,但它使阅读变得更加困难。
The up side here, is that the PDF format in itself is really simple. You have "objects" that follow a regular syntax. Then you can link them together to generate the content. The specification does a good job at describing the file format. But real world reading is going to take a bit of brain power...
这里的好处是 PDF 格式本身非常简单。您有遵循常规语法的“对象”。然后您可以将它们链接在一起以生成内容。该规范在描述文件格式方面做得很好。但现实世界的阅读需要一点脑力……
Some helpful pieces of advice that I had to learn the hard way if you're going to write it yourself:
如果您要自己编写,我必须通过艰苦的方式学习一些有用的建议:
- Adobe likes to re-map fonts. So character
65will likely not beA... You need to find a map object and deduce what it's doing based upon what characters are in there. And it is efficient since if a character doesn't appear in the document for that font, it doesn't include it (which makes life difficult if you try to programmatically edit a PDF)... - Write it as abstract as possible. Write classes for each object type, and each native type (strings, numbers, etc). Let those classes parse for you. There will be a fair bit of repetition in there, but you'll save yourself in the end when you realize that you need to tweak something for only one specific type)...
- Write for a specific version or two of the PDF spec, and enforce it. Check the version number, and if it's higher than you expect, bail... And don't try to "make it work". If you want to support newer versions, break out the specification and upgrade the parser from there. Don't try to trial and error your way up (it's not fun)...
- Good luck with compressed streams. I've found that typically you can't trust the length arguments to verify what you are uncompressing. Sometimes (for some generators) it works well... Others it's off by one or more bytes. I just attempt to deflate it if the filter matches, and then force the length...
- When testing lengths, don't use
strlen. Usemb_strlen($string, '8bit')since it will compensate for different character sets (and allow potentially invalid characters in other charsets).
- Adobe 喜欢重新映射字体。所以字符
65可能不会A......您需要找到一个地图对象并根据其中的字符推断它在做什么。而且它是有效的,因为如果该字体的文档中没有出现一个字符,它就不会包含它(如果您尝试以编程方式编辑 PDF,这会使生活变得困难)... - 写得尽可能抽象。为每个对象类型和每个原生类型(字符串、数字等)编写类。让这些类为您解析。那里会有相当多的重复,但是当您意识到只需要为一种特定类型调整某些内容时,您最终会拯救自己)...
- 为 PDF 规范的一个或两个特定版本编写,并强制执行它。检查版本号,如果它高于您的预期,请保释...并且不要试图“让它工作”。如果你想支持更新的版本,打破规范并从那里升级解析器。不要试图试错你的方式(这不好玩)......
- 祝压缩流好运。我发现通常您不能相信长度参数来验证您正在解压缩的内容。有时(对于某些生成器)它运行良好......其他的它会减少一个或多个字节。如果过滤器匹配,我只是尝试将其放气,然后强制长度...
- 测试长度时,不要使用
strlen. 使用,mb_strlen($string, '8bit')因为它将补偿不同的字符集(并允许其他字符集中的潜在无效字符)。
Otherwise, best of luck...
否则,祝你好运...
回答by Timo Haberkern
I use PDFBox for that (http://pdfbox.apache.org/). This software is javabased and platform independend. It works fast and reliable. You can use it via exec or shell execute or via a PHP/Java-Bridge (http://php-java-bridge.sourceforge.net/)
我为此使用 PDFBox(http://pdfbox.apache.org/)。这个软件是基于java的和平台独立的。它工作快速可靠。您可以通过 exec 或 shell execute 或通过 PHP/Java-Bridge ( http://php-java-bridge.sourceforge.net/) 使用它
回答by ryanday
Have you already looked at xPDF? There is a program in there called pdftotext that will do the conversion. You can call it from PHP and then read in the text version of the PDF. You will need to have the ability to run exec() or system() from php, so this may not work on all hosted solutions though.
你已经看过xPDF 了吗?那里有一个名为 pdftotext 的程序可以进行转换。您可以从 PHP 调用它,然后阅读 PDF 的文本版本。您将需要能够从 php 运行 exec() 或 system(),因此这可能不适用于所有托管解决方案。
Also, there are some examples on the PHP sitethat will convert PDF to text, although its pretty rough. You may want to try some of those examples as well. On that PHP page, search for luc at phpt dot org.
此外,PHP 站点上有一些示例可以将 PDF 转换为文本,尽管它非常粗糙。您可能还想尝试其中的一些示例。在那个PHP 页面上,在 phpt dot org 搜索 luc。
回答by Mark Redman
Have a look at GhostScript or ITextSharp, there are various cross-platform version of both.
看看 GhostScript 或 ITextSharp,两者都有各种跨平台版本。
回答by Bill Karwin
Zend_Pdfis part of the Zend Framework. Their manual states:
The
Zend_Pdfcomponent is a PDF (Portable Document Format) manipulation engine. It can load, create, modify and save documents. Thus it can help any PHP application dynamically create PDF documents by modifying existing documents or generating new ones from scratch.
该
Zend_Pdf组件是一个 PDF(便携式文档格式)操作引擎。它可以加载、创建、修改和保存文档。因此,它可以帮助任何 PHP 应用程序通过修改现有文档或从头开始生成新文档来动态创建 PDF 文档。
回答by mark stephens
It may not actually be a table inside the PDF as the PDF loses that sort of information...
它实际上可能不是 PDF 中的表格,因为 PDF 丢失了那种信息......

